Introduction

We will go through examples of how our Full House Model era-adjusts batting averages (BA) using the parametric method, and bWAR (Baseball-Reference Win Above Replacement), fWAR (Fangraphs Win Above Replacement), HR (Home Run), BB (Walks) for batters, bWAR, fWAR, ERA (Earned Run Average), SO (Strikeout) for pitchers using the nonparametric method. Also we provide details about how the tables and figures in the paper and supplementary materials are produced.

We first load in relevant software packages and the data. The necessary data are collected from Baseball-Reference, Fangraphs and Github Chadwick. The details of data collection are in the Supplement Materials. The revelant datasets that involved in this technical report can be found in the Tech_report_data

For batters, each row of data consists of a player ID, a year ID, a name, a age, a league ID (lgID), a recorded number of at bats (AB), game(G), plate appearances (PA), a park-factored batting average (BA), a walk (BB), a observed hits (obs_hits), observed home run (obs_HR), a hit-by-pitch (HBP), a sacrifice bunt (SH), a sacrifice fly (SF), a park-factored hits (H), a park-factored home run (HR), a baseball-reference WAR (bWAR), a fangraph WAR (fWAR) and a size of the talent pool (pops).

For pitchers, each row of data consists of a player ID, a year ID, a name, a age, a league ID (lgID), a team ID (teamID), a recorded number of inning pitched (IP) and a game(G), a observed earned run (obs_ER), a observed home run (obs_HR), a observed hits (obs_H), a strikeout (SO), a park-factored earned run (ER), a park-factored home run (HR_PF), a park-factored hits (H_PF), a walk (BB), a hit-by-pitch (HBP), a baseball-reference WAR (bWAR), a fangraph WAR (fWAR) and a size of the talent pool (pops).

rm(list=ls())
library(tidyverse)
library(orderstats)
library(Pareto)
library(doParallel)
library(splines)
library(retrosheet)
library(kableExtra)
library(Lahman)
ncores <- detectCores() - 1
years <- 1871:2023
load("pop_data.RData")
load("bat_dat.RData")
load("pit_dat.RData")
source('source.R')

In this analysis we assume that BA \(\stackrel{i i d}{\sim} N\left(\mu_{i}, \sigma_{i}\right)\) and that talent scores \(\stackrel{i i d}{\sim} \operatorname{Pareto}(\alpha)\) where \(\alpha=1.16\). We do not have any assumptions on the distribution of the baseball statistics. This choice of alpha corresponds to the Pareto principle which is casually referred to as the \(80 / 20\) rule.

Batters

In this section, our Full House Model era-adjusts batting statistics, such as bWAR, fWAR, HR, BB and BA using the nonparametric distribution measuring the components. We also apply our Full House Model to era-adjuste BA using the parametric distribution measuring the components.

bWAR

We now try out the bWAR for batters and first we select the full-time batters. We declare the median PA after screening out individuals who batted fewer than 75 PA as our cutoff for full-time batters.

batters <- bat_dat %>% select(yearID, playerID, lgID, name, age, PA, G, bWAR, pops)
cutoff <- do.call(rbind, mclapply(years, mc.cores = ncores, FUN = function(xx){
    m <- batters %>% filter(yearID == xx) %>% filter(PA >= 75)
    data.frame(thres = median(m$PA), yearID = xx)
  }))
batters <- merge(batters, cutoff, by = 'yearID')
batters <- batters %>% mutate(comp =  bWAR / G, full_time = ifelse(PA >= thres, 'Y', 'N'))

In the 1994 and 1995, hitters and pitchers played fewer games per season than the regular season. To deal with extreme statistics from the small sample size, we motivated a shrinkage method that adjust the raw statistics toward a global average. Our shrinkage method follows a shrinkage of ballpark effect estimates motivated from Michael Schell’s book, Baseball’s All-Time Best Sluggers. This methods involves weighted average of raw components and league average, which is

\[\textrm{adjusted component} = (\textrm{raw components} \times \textrm{total ABs} + 4000 \times \textrm{league average})/(\textrm{totals AB + 4000}) \] When we use the shrinkage method to analyze our seasonal data, the shrinkage factor \(4000 / (\textrm{totals AB + 4000})\) get adjusted based on fraction of 4000 to the totals AB of the MLB teams. For example, \[\textrm{adjusted bWAR per game} = (\textrm{raw bWAR per game} \times \textrm{game} + 7 \times \textrm{league average})/(\textrm{game + 7}) \]

batters_schell <- do.call(rbind, mclapply(years, mc.cores = ncores, FUN = function(xx){
    int<- batters %>% filter(yearID == xx)
    lg_avg <- sum(int$bWAR)/sum(int$G)
    int %>% mutate(comp = (bWAR + lg_avg * 7)/(G + 7))
  }))

The following script computes the talent scores for bWAR per game from all batters and all seasons (1871 to 2022).

batters_talent_bWAR <- do.call(rbind, mclapply(years, mc.cores = ncores, function(yy){
  talent_computing_nonpara(dataset = batters_schell, component_name = "bWAR", year = yy, ystar = thresh_fun(component = batters_schell %>% filter(full_time == 'Y', yearID == yy) %>% select(comp), component_name = 'bWAR'), alpha = 1.16) })) %>% arrange(-WAR_talent)

We built a common mapping environment with regard to the provided season size based on the no-strike seasons rather than utilizing one specific season from 1871 to 2022 as the projected season. To acquire era-adjusted statistics, we project the players’ talent to this common mapping environment. Micheal Schell inspired us to use the National League seasons from 1977 through 1989, with the exception of the 1981 strike season, as the common mapping environment. The number of teams in these seasons that we choose remains constant in order to prevent the Major League’s expansion effect. We then construct an isotonic regression model of the corresponding components on the ordered talent scores. In the common mapping environment, this model provides the association between the components and the talent scores.

We set the total number of full-time players throughout all seasons to be equal to the number of components in the common mapping environment. This is due to the fact that the more components in the common mapping environment, the \(\widetilde{F}_{Y_i}(t)\) and \(F_{Y_i}(t)\) are more closely. Then we select this number of components based on the quantile mapping and the quantiles of the components are equally spaced. The talent scores of the components we choose from the common mapping environment are calculated using the isotonic regression model. Now we construct the common mapping environment completely.

no_strike <- batters_talent_bWAR %>% 
    filter(yearID < 1990, yearID >= 1977, yearID != 1981, 
           full_time == 'Y', lgID == 'NL') %>%
    arrange(WAR_talent)
  
t <- isoreg(no_strike$WAR_talent, no_strike$comp)
talent_new <- quantile(no_strike$WAR_talent, probs = (seq(0,250)/250))
comp_new <- as.stepfun(t)(talent_new)

The figure below shows the relationship between bWAR talent and bWAR per game. The black dots represent the observations from the full-time batters from the 1977 season to 1989 season with the exception of the 1981 strike season. The red dots represent the observations from the common mapping environment that we define above. Based on the figure above, the observations from the common mapping environment accurately depict the relationship between bWAR talent and bWAR per game from the 1977 season to the 1989 season with the exception of the 1981 strike season.

mapping_envir <- data.frame(talent_new = talent_new, comp_new = comp_new)
ggplot(data = no_strike, aes(x = WAR_talent, y = comp)) + geom_point() + 
  geom_point(data = mapping_envir, aes(x = talent_new, y = comp_new), color = 'red') +
  labs(x = 'bWAR talent', y = 'bWAR per game') +
  scale_x_continuous(trans='log')

ystar <- thresh_fun(comp_new, component_name = 'bWAR')
pop_new <- round(mean(no_strike$pops))
component_name = 'bWAR'
yy <- sort(comp_new)
n <- length(yy)
ytilde <- rep(0, n + 1)
if (component_name == 'bWAR' | component_name == 'fWAR') {
    ytilde[1] <- yy[1] - (yy[2] - yy[1])
}
if (component_name == 'HR' | component_name == 'BB') {
    # since the minimal HR is greater or equal to 0.
    ytilde[1] <- 0
}
ytilde[n+1] <- yy[n] + ystar
ytilde[2:n] <- unlist(lapply(2:n, function(j){
    (yy[j]+yy[j-1])/2 
}))

We now extend the method to compute hypothetical careers in the common mapping environment and obtain the era-adjusted statistics.

career_kAB_1st <- do.call(rbind, mclapply(1:30000, function(zz){
    int <- career_talent_nonpara(dataset = batters_talent_bWAR, component_name = 'bWAR', 
                         snippet = batters_talent_bWAR[zz,], alpha = 1.16)
    int
  }, mc.cores = ncores))
  
  
career_kAB_2nd <- do.call(rbind, mclapply(30001:60000, function(zz){
    int <- career_talent_nonpara(dataset = batters_talent_bWAR, component_name = 'bWAR', 
                         snippet = batters_talent_bWAR[zz,], alpha = 1.16)
    int
  }, mc.cores = ncores)) 
  
career_kAB_3rd <- do.call(rbind, mclapply(60001:nrow(batters_talent_bWAR), function(zz){
    int <- career_talent_nonpara(dataset = batters_talent_bWAR, component_name = 'bWAR', 
                         snippet = batters_talent_bWAR[zz,], alpha = 1.16)
    int
  }, mc.cores = ncores)) 
  
career_kAB <- rbind(career_kAB_1st, career_kAB_2nd, career_kAB_3rd)

Instead of using the raw G and PA in the data set, we calculate the mapped G by applying quantile mapping for the full-time hitters and non-full-time hitters separately. Quantile mapping is based on that a pth percentile player’s games in one year is equal to a pth percentile player’s games in the common mapping environment.

We also specify that worst performance of bWAR for the full time players in each season -2.

## mapping statistics
  
mapped_quan_b_raw <- do.call(rbind, mclapply(years, function(xx){
    batters_full <- batters %>% filter(yearID == xx) %>% 
      filter(full_time == 'Y') %>% arrange(-G)
    batters_less <- batters %>% filter(yearID == xx) %>% 
      filter(full_time == 'N') %>% arrange(-G)
    
    n1 <- nrow(batters_full)
    n2 <- nrow(batters_less)
    
    mapped_G_full <- c()
    mapped_G_less <- c()
    for (yy in c(1977:1980, 1982:1989)) {
      batters_ref_full <- batters %>% 
        filter(yearID == yy, full_time == 'Y') %>% arrange(-G)
      batters_ref_less <- batters %>% 
        filter(yearID == yy, full_time == 'N') %>% arrange(-G)
      n1r <- nrow(batters_ref_full)
      n2r <- nrow(batters_ref_less)
      
      mapped_G_full <- cbind(mapped_G_full, approx(x = seq((n1r-1),0)/(n1r-1), 
                                                   y = batters_ref_full$G, 
                                                   xout = seq((n1-1),0)/(n1-1))$y)
      mapped_G_less <- cbind(mapped_G_less, approx(x = seq((n2r-1),0)/(n2r-1), 
                                                   y = batters_ref_less$G, 
                                                   xout = seq((n2-1),0)/(n2-1))$y)
    }
    
    batters_full$mapped_G <- rowMeans(mapped_G_full)
    batters_less$mapped_G <- rowMeans(mapped_G_less)
    
    
    batters_full <- batters_full %>% arrange(-PA)
    batters_less <- batters_less %>% arrange(-PA)
    mapped_PA_full <- c()
    mapped_PA_less <- c()
    for (yy in c(1977:1980, 1982:1989)) {
      batters_ref_full <- batters %>% 
        filter(yearID == yy, full_time == 'Y') %>% arrange(-PA)
      batters_ref_less <- batters %>% 
        filter(yearID == yy, full_time == 'N') %>% arrange(-PA)
      n1r <- nrow(batters_ref_full)
      n2r <- nrow(batters_ref_less)
      
      mapped_PA_full <- cbind(mapped_PA_full, approx(x = seq((n1r-1),0)/(n1r-1), 
                                                     y = batters_ref_full$PA, 
                                                     xout = seq((n1-1),0)/(n1-1))$y)
      mapped_PA_less <- cbind(mapped_PA_less, approx(x = seq((n2r-1),0)/(n2r-1), 
                                                     y = batters_ref_less$PA, 
                                                     xout = seq((n2-1),0)/(n2-1))$y)
    }
    
    batters_full$mapped_PA <- rowMeans(mapped_PA_full)
    batters_less$mapped_PA <- rowMeans(mapped_PA_less)
    
    m <- rbind(batters_full, batters_less)
    data.frame(playerID = m$playerID, yearID = m$yearID,
               mapped_G_raw = round(m$mapped_G), mapped_PA_raw = round(m$mapped_PA))
    
  }, mc.cores = ncores))
   
mapped_batters_1 <- merge(career_kAB, mapped_quan_b_raw, 
                            by = c('playerID', 'yearID'))
  
min_refbWAR <- -2
  
mapped_batters_bWAR <- mapped_batters_1 %>% 
    mutate(adj_bWAR = adj_comp * mapped_G_raw) %>% 
    mutate(adj_bWAR = ifelse(adj_bWAR < min_refbWAR, min_refbWAR, adj_bWAR)) %>%
    mutate(mapped_G_bWAR = round(adj_bWAR / adj_comp))

Then we apply the same techniques for fWAR.

We also apply our full house model to BB and HR. Before we do that, we get lower bounds on walk rate and home run rate. We isolate players with batting records from the common mapping environment and restrict attention to players with 10 years of batting records. Then we obtain the 0.03rd quantile of walk rate and home run rate for the these players for seasons in which they obtained at least 400 PAs and serve the quantiles as minimum allowable rates in the common mapping environment.

IDs7789 = Batting %>% 
  filter(yearID >= 1977, yearID <= 1989, yearID != 1981) %>% 
  pull(playerID)

IDs10 = Batting %>% 
  filter(playerID %in% IDs7789) %>% 
  group_by(playerID) %>% 
  summarise(n = n()) %>% 
  filter(n >= 10) %>% 
  pull(playerID)

lowerb <- Batting %>% 
  filter(playerID %in% IDs10) %>% 
  filter(yearID >= 1977, yearID <= 1989, yearID != 1981) %>% 
  mutate(PA = AB + BB + HBP + SH + SF) %>% 
  filter(PA >= 400) %>% 
  mutate(BBrate = BB/PA, HRrate = HR/AB) %>% 
  select(playerID, yearID, BBrate, HRrate) %>% 
  summarise(Q03BB = quantile(BBrate, probs = 0.03), 
            Q03HR = quantile(HRrate, probs = 0.03))
lowerb
##        Q03BB       Q03HR
## 1 0.03512232 0.001848296

Stephen Jay Gould suggests that the BA in every season follows normal distribution and we perform Shapiro-Wilk Normality Test to verify this argument. Based on the results, 16 out of 152 seasons from 1971 to 2022 seasons fail the normality test. The 16 seasons are 1876, 1895, 1896, 1904, 1908, 1912, 1916, 1922, 1924, 1941, 1946, 1959, 1961, 1967, 1972, 1987, 1994, 1997.

batters <- bat_dat
  
  batters <- merge(batters, cutoff, by = 'yearID')
  
  batters <- batters %>% select(playerID, yearID, lgID, name, age, lgID, AB, 
                                PA, obs_hits, H, thres, bWAR, pops) %>%
    mutate(comp = ifelse(AB != 0, H / AB, 0), full_time = ifelse(PA >= thres, 'Y', 'N'))
  
  batters_schell <- do.call(rbind, mclapply(years, mc.cores = ncores, FUN = function(xx){
    int<- batters %>% filter(yearID == xx)
    lg_avg <- sum(int$H)/sum(int$AB)
    int %>% mutate(comp = (H + lg_avg * 25)/(AB + 25))
  }))
  
  normality <- do.call(rbind, mclapply(years, mc.cores = ncores, FUN = function(xx){
    m <- batters_schell %>% filter(yearID == xx, full_time == 'Y') %>% pull(comp) %>% shapiro.test()
    data.frame(yearID = xx, p_value = m$p.value)
  }))
  
  years[which(normality$p_value <= 0.05)]
##  [1] 1876 1895 1896 1904 1908 1912 1916 1922 1924 1941 1946 1959 1961 1967 1972
## [16] 1987 1994 1997 2023

Now we use both parametric and non-parametric distribution to measure the BA.

We eliminate players who had an era-adjusted bWAR or fWAR below the replacement level for more than half of their career seasons. We also set three rules to eliminate some players’ poor early and late career seasons. The three rules are

  • In at least 2 consecutive seasons, the era-adjusted bWAR is below -1.5.
  • In at least 2 consecutive seasons, the era-adjusted fWAR is below -1.5.
  • In at least two consecutive seasons, no more than one era-adjusted bWAR or era-adjusted fWAR can be more than 0.2.

The value 0.2 is calculated from the average bWAR or fWAR of the players that disappeared from the MLB from the 1977 season to 1989 season with the exception of 1981 season.

After getting the era-adjusted statistics, we find some players’ statistics dramatically change in tails of their career, which is unrealistic in the real life. To solve it, we apply some smoothing methods to alleviate these dramatic variations, such as local polynomial regression fitting and natural cubic spline. Then natural cubic spline method has the minimal bias and is considered as the best option compared with other methods.

Also, we notice that the smoothing method could weaken player’s prime or extreme seasons. Therefore, we take the average of the smoothed era-adjusted statistics and era-adjusted statistics, which can keep player’s prime season and alleviate the dramatic changes in the tail of their career.

AVG_part <- mapped_batters_AVG_nonpara %>%
    mutate(adj_AVG = round(adj_AVG, 3)) %>% 
    select(yearID, playerID, name, adj_AVG) 
  HR_part <- mapped_batters_HR %>% 
    mutate(mapped_PA = round(mapped_PA)) %>% 
    mutate(adj_HR = round(adj_HR)) %>%
    mutate(adj_AB = round(adj_AB)) %>%
    select(yearID, playerID, adj_HR, adj_AB, HBP, SF, mapped_PA) 
  bWAR_part <- mapped_batters_bWAR %>%
    mutate(adj_bWAR = round(adj_bWAR, 2)) %>%
    select(yearID, playerID, adj_bWAR, mapped_G_bWAR) 
  fWAR_part <- mapped_batters_fWAR %>%
    mutate(adj_fWAR = round(adj_fWAR, 2)) %>%
    select(yearID, playerID, age, adj_fWAR, mapped_G_fWAR) 
  BB_part <- mapped_batters_HR %>% 
    mutate(adj_BB = round(adj_BB)) %>%
    select(yearID, playerID, BB, adj_BB)
  master_batters <- merge(BB_part, merge(AVG_part, 
                                         merge(HR_part, merge(bWAR_part, fWAR_part, 
                                                              by = c('yearID', 'playerID')), 
                                               by = c('yearID', 'playerID')), 
                                         by = c('yearID', 'playerID')), 
                          by = c('yearID', 'playerID'))
  
  master_batters <- master_batters %>% 
    mutate(adj_OBP = round((adj_AVG * adj_AB + adj_BB + HBP) / (adj_AB + adj_BB + HBP + SF), 3))
  master_batters$adj_OBP[is.na(master_batters$adj_OBP)] <- 0
  
  master_batters_nonpara <- master_batters %>% mutate(adj_BB = ifelse(adj_BB < 0, 0, adj_BB)) %>% 
    mutate(adj_AB = ifelse(mapped_PA < adj_AB, mapped_PA, adj_AB))
  master_batters_nonpara$mapped_G <- apply(master_batters_nonpara[,c(13,16)], 1, min)
  
  ## Batters
  batters <- master_batters_nonpara

  ## extract and remove bad players 
  foo <- batters %>% 
    arrange(desc(adj_AVG)) %>% 
    filter(adj_AB >= 300) %>% 
    dplyr::select(name, playerID, yearID, adj_AB, adj_AVG, adj_OBP, adj_HR, adj_fWAR, adj_bWAR) %>% 
    mutate(adj_HR_AB = round(adj_HR/adj_AB,4))
  bar <- split(foo, as.factor(foo$playerID))
  baz <- do.call(rbind, lapply(bar, function(m){
    m[which.max(m$adj_fWAR), ]
  }))
  baz <- baz %>% arrange(adj_fWAR)
  bad_players_fWAR <- baz %>% filter(adj_fWAR < 0) %>% pull(playerID)
  
  baz <- do.call(rbind, lapply(bar, function(m){
    m[which.max(m$adj_bWAR), ]
  }))
  baz <- baz %>% arrange(adj_bWAR)
  bad_players_bWAR <- baz %>% filter(adj_bWAR < 0) %>% pull(playerID)
  bad_players <- union(bad_players_bWAR, bad_players_fWAR)
  batters <- batters[!batters$playerID %in% bad_players, ]
  
  ###### investigate anomalies ######
  
  ## more on base events than PAs
  
  # check average for minimal at bats and correct issues
  batters[batters$adj_AVG > 0 & batters$adj_AB == 0, ]$adj_AVG <- 0 
  batters <- batters %>% mutate(adj_hits = round(adj_AVG * adj_AB))
  batters[batters$adj_AB > 0, ]$adj_AVG <- 
    round(batters[batters$adj_AB > 0, ]$adj_hits / batters[batters$adj_AB > 0, ]$adj_AB, 3)
  
  ## build adjusted data set
  batters_adjusted <- batters %>% 
    dplyr::select(name, playerID, age, yearID, mapped_PA, adj_AB, adj_hits, adj_HR, adj_BB, 
                  adj_AVG, adj_OBP, HBP, SF, adj_bWAR, adj_fWAR)
  colnames(batters_adjusted) <- c("name", "playerID", "age", "year", "mapped_PA", "adj_AB", "adj_H", 
                                  "adj_HR", "adj_BB", "adj_AVG", "adj_OBP", "HBP", "SF", "adj_bWAR", "adj_fWAR")
  batters_adjusted$playerID <- droplevels(as.factor(batters_adjusted$playerID))

  
  ## trim out bad players
  # first round
  foo <- split(batters_adjusted, f = batters_adjusted$playerID)
  bar <- lapply(foo, function(m){
    ifelse(m$adj_bWAR <= 0, 1, 0) + ifelse(m$adj_fWAR <= 0, 1, 0)
  })
  checker <- data.frame(pid = levels(batters_adjusted$playerID), 
                        m_bad = unlist(lapply(bar, mean)), 
                        len = unlist(lapply(bar, length)))
  batters_adjusted <- batters_adjusted %>% 
    filter(!batters_adjusted$playerID %in% rownames(checker)[checker$m_bad == 2])
  batters_adjusted$playerID <- droplevels(batters_adjusted$playerID)
  
  # second round
  foo <- split(batters_adjusted, f = batters_adjusted$playerID)
  bar <- lapply(foo, function(m){
    ifelse(m$adj_bWAR <= 0, 1, 0) + ifelse(m$adj_fWAR <= 0, 1, 0)
  })
  checker <- data.frame(pid = levels(batters_adjusted$playerID), 
                        m_bad = unlist(lapply(bar, mean)), 
                        len = unlist(lapply(bar, length)))
  batters_adjusted <- batters_adjusted %>% 
    filter(!batters_adjusted$playerID %in% rownames(checker)[checker$m_bad >= 1 & checker$len <= 2])
  batters_adjusted$playerID <- droplevels(batters_adjusted$playerID)
  
  # third round
  foo <- split(batters_adjusted, f = batters_adjusted$playerID)
  bar <- lapply(foo, function(m){
    min(ifelse(m$adj_bWAR <= 0, 1, 0) + ifelse(m$adj_fWAR <= 0, 1, 0))
  })
  checker <- data.frame(pid = levels(batters_adjusted$playerID), 
                        m_bad = unlist(lapply(bar, mean)), 
                        len = unlist(lapply(bar, length)))
  batters_adjusted <- batters_adjusted %>% 
    filter(!batters_adjusted$playerID %in% rownames(checker)[checker$m_bad >= 1])
  batters_adjusted$playerID <- droplevels(batters_adjusted$playerID)
  
  # forth round
  foo <- split(batters_adjusted, f = batters_adjusted$playerID)
  bar <- lapply(foo, function(m){
    ifelse(m$adj_bWAR == -2, 1, 0) + ifelse(m$adj_fWAR == -2, 1, 0)
  })
  checker <- data.frame(pid = levels(batters_adjusted$playerID), 
                        m_bad = unlist(lapply(bar, mean)), 
                        len = unlist(lapply(bar, length)))
  batters_adjusted <- batters_adjusted %>% 
    filter(!batters_adjusted$playerID %in% rownames(checker)[checker$m_bad >= 1])
  batters_adjusted$playerID <- droplevels(batters_adjusted$playerID)
  
  
  ## remove tails 
  foo <- split(batters_adjusted, f = batters_adjusted$playerID)
  bar <- lapply(foo, function(m){
    bad <- ifelse(m$adj_bWAR <= 0.2, 1, 0) + ifelse(m$adj_fWAR <= 0.2, 1, 0)
    bad_tail <- sum(c(ifelse(sum(tail(bad, 2)) >= 3,1,0),
                      ifelse(sum(tail(bad, 3)) >= 5,1,0),
                      ifelse(sum(tail(bad, 4)) >= 7,1,0),
                      ifelse(sum(tail(bad, 5)) >= 9,1,0),
                      ifelse(sum(tail(bad, 6)) >= 11,1,0)))
    1:(length(bad)-bad_tail)
  })
  batters_adjusted_1 <- do.call(rbind, lapply(1:length(bar), function(j){
    foo[[j]][bar[[j]], ]
  })) %>% arrange(year)
  
  foo <- split(batters_adjusted_1, f = batters_adjusted_1$playerID)
  bar <- lapply(foo, function(m){
    bad <- ifelse(m$adj_fWAR <= -1.5, 1, 0) 
    bad_tail <- sum(c(ifelse(sum(tail(bad, 2)) >= 2,1,0),
                      ifelse(sum(tail(bad, 3)) >= 3,1,0),
                      ifelse(sum(tail(bad, 4)) >= 4,1,0),
                      ifelse(sum(tail(bad, 5)) >= 5,1,0),
                      ifelse(sum(tail(bad, 6)) >= 6,1,0)))
    1:(length(bad)-bad_tail)
  })
  batters_adjusted_2 <- do.call(rbind, lapply(1:length(bar), function(j){
    foo[[j]][bar[[j]], ]
  })) %>% arrange(year)
  
  foo <- split(batters_adjusted_2, f = batters_adjusted_2$playerID)
  bar <- lapply(foo, function(m){
    bad <- ifelse(m$adj_bWAR <= -1.5, 1, 0) 
    bad_tail <- sum(c(ifelse(sum(tail(bad, 2)) >= 2,1,0),
                      ifelse(sum(tail(bad, 3)) >= 3,1,0),
                      ifelse(sum(tail(bad, 4)) >= 4,1,0),
                      ifelse(sum(tail(bad, 5)) >= 5,1,0),
                      ifelse(sum(tail(bad, 6)) >= 6,1,0)))
    1:(length(bad)-bad_tail)
  })
  batters_adjusted_3 <- do.call(rbind, lapply(1:length(bar), function(j){
    foo[[j]][bar[[j]], ]
  })) %>% arrange(year)
  
  ## remove starts
  foo <- split(batters_adjusted_3, f = batters_adjusted_3$playerID)
  bar <- lapply(foo, function(m){
    bad <- ifelse(m$adj_bWAR <= 0, 1, 0) + ifelse(m$adj_fWAR <= 0, 1, 0)
    bad_head <- sum(c(ifelse(sum(head(bad, 1)) == 2,1,0),
                      ifelse(sum(head(bad, 2)) >= 3,1,0),
                      ifelse(sum(head(bad, 3)) >= 5,1,0),
                      ifelse(sum(head(bad, 4)) >= 7,1,0),
                      ifelse(sum(head(bad, 5)) >= 9,1,0),
                      ifelse(sum(head(bad, 6)) >= 11,1,0)))
    if (bad_head < length(bad)) {
      (bad_head + 1):length(bad)
    }
  })
  batters_adjusted_4 <- do.call(rbind, lapply(1:length(bar), function(j){
    foo[[j]][bar[[j]], ]
  })) %>% arrange(year)
  
  batters_adjusted_4$playerID <- droplevels(batters_adjusted_4$playerID)
  
  # taper down average WAR for players with small PAs
  batters_adjusted_4[batters_adjusted_4$mapped_PA <= 20, ]$adj_fWAR <- 
    round(batters_adjusted_4[batters_adjusted_4$mapped_PA <= 20, ]$adj_fWAR/9,2)
  batters_adjusted_4[batters_adjusted_4$mapped_PA <= 20, ]$adj_bWAR <- 
    round(batters_adjusted_4[batters_adjusted_4$mapped_PA <= 20, ]$adj_bWAR/9,2)    
  
  batters_adjusted <- do.call(rbind, mclapply(
  split(batters_adjusted_4, f = droplevels(as.factor(batters_adjusted_4$playerID))), 
  mc.cores = ncores, FUN = function(xx){
    ## natural cubic spline
    #ns_AVG = lm(adj_AVG ~ ns(year, df=6), data=xx)
    #nn_AVG <- predict(ns_AVG, data.frame("year"= xx$year))
    #ns_HR = lm(adj_HR ~ ns(year, df=6), data=xx)
    #nn_HR <- predict(ns_HR, data.frame("year"= xx$year))
    #ns_BB = lm(adj_BB ~ ns(year, df=6), data=xx)
    #nn_BB <- predict(ns_BB, data.frame("year"= xx$year))
    #ns_bWAR = lm(adj_bWAR ~ ns(year, df=6), data=xx)
    #nn_bWAR <- predict(ns_bWAR, data.frame("year"= xx$year))
    #ns_fWAR = lm(adj_fWAR ~ ns(year, df=6), data=xx)
    #nn_fWAR <- predict(ns_fWAR, data.frame("year"= xx$year))
    
    xx %>% mutate(AVG = round(adj_AVG, 3)) %>% 
      mutate(HR = round(adj_HR)) %>%
      mutate(BB = round(adj_BB)) %>%
      mutate(ebWAR = round(adj_bWAR, 2)) %>%
      mutate(efWAR = round(adj_fWAR, 2))
  })) 
  
  bat_season <- batters_adjusted %>% 
    mutate(PA = adj_AB + BB + HBP + SF)
  bat_season[bat_season$HR >= bat_season$adj_AB, ]$HR <- 
    bat_season[bat_season$HR >= bat_season$adj_AB, ]$adj_AB

 bat_season <- bat_season %>% 
   dplyr::select(-c("adj_H", "adj_HR", "adj_BB", "adj_AVG", "adj_OBP", "adj_bWAR", "adj_fWAR", 'mapped_PA'))
 bat_season <- bat_season %>% mutate(OBP = round((AVG * adj_AB + BB + HBP)/(adj_AB + BB + HBP + SF), 3 )) %>%
   mutate(AVG = ifelse(AVG > 0, AVG, 0)) %>%
   mutate(HR = ifelse(HR > 0, HR, 0)) %>%
   mutate(BB = ifelse(BB > 0, BB, 0)) %>%
   mutate(OBP = ifelse(OBP > 0, OBP, 0))
 
 colnames(bat_season)[5] <- 'AB'
 colnames(bat_season)[8] <- 'BA'
 bat_season_nonpara <- bat_season %>% mutate(H = ceiling(AB * BA)) %>%
   mutate(BA = round(H / AB, 3)) %>%
   mutate(BA = ifelse(AB == 0, 0, BA)) %>%
   mutate(OBP = round((H+BB+HBP)/(AB+BB+HBP+SF), 3)) %>%
   mutate(OBP = ifelse(AB+BB+HBP+SF == 0, 0, OBP))
 
  bat_career_nonpara <- bat_season_nonpara %>% group_by(playerID) %>% 
    summarise(name = unique(name), 
              playerID = unique(playerID), 
              PA = sum(round(PA)), 
              AB = sum(AB), 
              H = sum(H), 
              HR = sum(round(HR)), 
              BB = sum(round(BB)), 
              BA = round(H/AB, 3), 
              HBP = sum(HBP), 
              SF = sum(SF), 
              OBP = round((H + BB + HBP)/(AB + BB + HBP + SF), 3), 
              ebWAR = sum(ebWAR), 
              efWAR = sum(efWAR)) %>% ungroup() %>% 
    arrange(desc(ebWAR))
  
  bat_career_nonpara <- bat_career_nonpara %>% mutate(BA = ifelse(AB == 0, 0, BA))
  bat_career_nonpara <- bat_career_nonpara %>% mutate(OBP = ifelse(AB + BB + HBP + SF == 0, 0, OBP))
AVG_part <- mapped_batters_AVG_para %>%
    mutate(adj_AVG = round(adj_AVG, 3)) %>% 
    select(yearID, playerID, name, adj_AVG) 
  HR_part <- mapped_batters_HR %>% 
    mutate(mapped_PA = round(mapped_PA)) %>% 
    mutate(adj_HR = round(adj_HR)) %>%
    mutate(adj_AB = round(adj_AB)) %>%
    select(yearID, playerID, adj_HR, adj_AB, HBP, SF, mapped_PA) 
  bWAR_part <- mapped_batters_bWAR %>%
    mutate(adj_bWAR = round(adj_bWAR, 2)) %>%
    select(yearID, playerID, adj_bWAR, mapped_G_bWAR) 
  fWAR_part <- mapped_batters_fWAR %>%
    mutate(adj_fWAR = round(adj_fWAR, 2)) %>%
    select(yearID, playerID, age, adj_fWAR, mapped_G_fWAR) 
  BB_part <- mapped_batters_HR %>% 
    mutate(adj_BB = round(adj_BB)) %>%
    select(yearID, playerID, BB, adj_BB)
  master_batters <- merge(BB_part, merge(AVG_part, 
                                         merge(HR_part, merge(bWAR_part, fWAR_part, 
                                                              by = c('yearID', 'playerID')), 
                                               by = c('yearID', 'playerID')), 
                                         by = c('yearID', 'playerID')), 
                          by = c('yearID', 'playerID'))
  
  master_batters <- master_batters %>% 
    mutate(adj_OBP = round((adj_AVG * adj_AB + adj_BB + HBP) / (adj_AB + adj_BB + HBP + SF), 3))
  master_batters$adj_OBP[is.na(master_batters$adj_OBP)] <- 0
  
  master_batters_para <- master_batters %>% mutate(adj_BB = ifelse(adj_BB < 0, 0, adj_BB)) %>% 
    mutate(adj_AB = ifelse(mapped_PA < adj_AB, mapped_PA, adj_AB))
  master_batters_para$mapped_G <- apply(master_batters_para[,c(13,16)], 1, min)
  
  ## Batters
  batters <- master_batters_para

  ## extract and remove bad players 
  foo <- batters %>% 
    arrange(desc(adj_AVG)) %>% 
    filter(adj_AB >= 300) %>% 
    dplyr::select(name, playerID, yearID, adj_AB, adj_AVG, adj_OBP, adj_HR, adj_fWAR, adj_bWAR) %>% 
    mutate(adj_HR_AB = round(adj_HR/adj_AB,4))
  bar <- split(foo, as.factor(foo$playerID))
  baz <- do.call(rbind, lapply(bar, function(m){
    m[which.max(m$adj_fWAR), ]
  }))
  baz <- baz %>% arrange(adj_fWAR)
  bad_players_fWAR <- baz %>% filter(adj_fWAR < 0) %>% pull(playerID)
  
  baz <- do.call(rbind, lapply(bar, function(m){
    m[which.max(m$adj_bWAR), ]
  }))
  baz <- baz %>% arrange(adj_bWAR)
  bad_players_bWAR <- baz %>% filter(adj_bWAR < 0) %>% pull(playerID)
  bad_players <- union(bad_players_bWAR, bad_players_fWAR)
  batters <- batters[!batters$playerID %in% bad_players, ]
  
  ###### investigate anomalies ######
  
  ## more on base events than PAs
  
  # check average for minimal at bats and correct issues
  batters[batters$adj_AVG > 0 & batters$adj_AB == 0, ]$adj_AVG <- 0 
  batters <- batters %>% mutate(adj_hits = round(adj_AVG * adj_AB))
  batters[batters$adj_AB > 0, ]$adj_AVG <- 
    round(batters[batters$adj_AB > 0, ]$adj_hits / batters[batters$adj_AB > 0, ]$adj_AB, 3)
  
  ## build adjusted data set
  batters_adjusted <- batters %>% 
    dplyr::select(name, playerID, age, yearID, mapped_PA, adj_AB, adj_hits, adj_HR, adj_BB, 
                  adj_AVG, adj_OBP, HBP, SF, adj_bWAR, adj_fWAR)
  colnames(batters_adjusted) <- c("name", "playerID", "age", "year", "mapped_PA", "adj_AB", "adj_H", 
                                  "adj_HR", "adj_BB", "adj_AVG", "adj_OBP", "HBP", "SF", "adj_bWAR", "adj_fWAR")
  batters_adjusted$playerID <- droplevels(as.factor(batters_adjusted$playerID))

  
  ## trim out bad players
  # first round
  foo <- split(batters_adjusted, f = batters_adjusted$playerID)
  bar <- lapply(foo, function(m){
    ifelse(m$adj_bWAR <= 0, 1, 0) + ifelse(m$adj_fWAR <= 0, 1, 0)
  })
  checker <- data.frame(pid = levels(batters_adjusted$playerID), 
                        m_bad = unlist(lapply(bar, mean)), 
                        len = unlist(lapply(bar, length)))
  batters_adjusted <- batters_adjusted %>% 
    filter(!batters_adjusted$playerID %in% rownames(checker)[checker$m_bad == 2])
  batters_adjusted$playerID <- droplevels(batters_adjusted$playerID)
  
  # second round
  foo <- split(batters_adjusted, f = batters_adjusted$playerID)
  bar <- lapply(foo, function(m){
    ifelse(m$adj_bWAR <= 0, 1, 0) + ifelse(m$adj_fWAR <= 0, 1, 0)
  })
  checker <- data.frame(pid = levels(batters_adjusted$playerID), 
                        m_bad = unlist(lapply(bar, mean)), 
                        len = unlist(lapply(bar, length)))
  batters_adjusted <- batters_adjusted %>% 
    filter(!batters_adjusted$playerID %in% rownames(checker)[checker$m_bad >= 1 & checker$len <= 2])
  batters_adjusted$playerID <- droplevels(batters_adjusted$playerID)
  
  # third round
  foo <- split(batters_adjusted, f = batters_adjusted$playerID)
  bar <- lapply(foo, function(m){
    min(ifelse(m$adj_bWAR <= 0, 1, 0) + ifelse(m$adj_fWAR <= 0, 1, 0))
  })
  checker <- data.frame(pid = levels(batters_adjusted$playerID), 
                        m_bad = unlist(lapply(bar, mean)), 
                        len = unlist(lapply(bar, length)))
  batters_adjusted <- batters_adjusted %>% 
    filter(!batters_adjusted$playerID %in% rownames(checker)[checker$m_bad >= 1])
  batters_adjusted$playerID <- droplevels(batters_adjusted$playerID)
  
  # forth round
  foo <- split(batters_adjusted, f = batters_adjusted$playerID)
  bar <- lapply(foo, function(m){
    ifelse(m$adj_bWAR == -2, 1, 0) + ifelse(m$adj_fWAR == -2, 1, 0)
  })
  checker <- data.frame(pid = levels(batters_adjusted$playerID), 
                        m_bad = unlist(lapply(bar, mean)), 
                        len = unlist(lapply(bar, length)))
  batters_adjusted <- batters_adjusted %>% 
    filter(!batters_adjusted$playerID %in% rownames(checker)[checker$m_bad >= 1])
  batters_adjusted$playerID <- droplevels(batters_adjusted$playerID)
  
  
  ## remove tails 
  foo <- split(batters_adjusted, f = batters_adjusted$playerID)
  bar <- lapply(foo, function(m){
    bad <- ifelse(m$adj_bWAR <= 0.2, 1, 0) + ifelse(m$adj_fWAR <= 0.2, 1, 0)
    bad_tail <- sum(c(ifelse(sum(tail(bad, 2)) >= 3,1,0),
                      ifelse(sum(tail(bad, 3)) >= 5,1,0),
                      ifelse(sum(tail(bad, 4)) >= 7,1,0),
                      ifelse(sum(tail(bad, 5)) >= 9,1,0),
                      ifelse(sum(tail(bad, 6)) >= 11,1,0)))
    1:(length(bad)-bad_tail)
  })
  batters_adjusted_1 <- do.call(rbind, lapply(1:length(bar), function(j){
    foo[[j]][bar[[j]], ]
  })) %>% arrange(year)
  
  foo <- split(batters_adjusted_1, f = batters_adjusted_1$playerID)
  bar <- lapply(foo, function(m){
    bad <- ifelse(m$adj_fWAR <= -1.5, 1, 0) 
    bad_tail <- sum(c(ifelse(sum(tail(bad, 2)) >= 2,1,0),
                      ifelse(sum(tail(bad, 3)) >= 3,1,0),
                      ifelse(sum(tail(bad, 4)) >= 4,1,0),
                      ifelse(sum(tail(bad, 5)) >= 5,1,0),
                      ifelse(sum(tail(bad, 6)) >= 6,1,0)))
    1:(length(bad)-bad_tail)
  })
  batters_adjusted_2 <- do.call(rbind, lapply(1:length(bar), function(j){
    foo[[j]][bar[[j]], ]
  })) %>% arrange(year)
  
  foo <- split(batters_adjusted_2, f = batters_adjusted_2$playerID)
  bar <- lapply(foo, function(m){
    bad <- ifelse(m$adj_bWAR <= -1.5, 1, 0) 
    bad_tail <- sum(c(ifelse(sum(tail(bad, 2)) >= 2,1,0),
                      ifelse(sum(tail(bad, 3)) >= 3,1,0),
                      ifelse(sum(tail(bad, 4)) >= 4,1,0),
                      ifelse(sum(tail(bad, 5)) >= 5,1,0),
                      ifelse(sum(tail(bad, 6)) >= 6,1,0)))
    1:(length(bad)-bad_tail)
  })
  batters_adjusted_3 <- do.call(rbind, lapply(1:length(bar), function(j){
    foo[[j]][bar[[j]], ]
  })) %>% arrange(year)
  
  ## remove starts
  foo <- split(batters_adjusted_3, f = batters_adjusted_3$playerID)
  bar <- lapply(foo, function(m){
    bad <- ifelse(m$adj_bWAR <= 0, 1, 0) + ifelse(m$adj_fWAR <= 0, 1, 0)
    bad_head <- sum(c(ifelse(sum(head(bad, 1)) == 2,1,0),
                      ifelse(sum(head(bad, 2)) >= 3,1,0),
                      ifelse(sum(head(bad, 3)) >= 5,1,0),
                      ifelse(sum(head(bad, 4)) >= 7,1,0),
                      ifelse(sum(head(bad, 5)) >= 9,1,0),
                      ifelse(sum(head(bad, 6)) >= 11,1,0)))
    if (bad_head < length(bad)) {
      (bad_head + 1):length(bad)
    }
  })
  batters_adjusted_4 <- do.call(rbind, lapply(1:length(bar), function(j){
    foo[[j]][bar[[j]], ]
  })) %>% arrange(year)
  
  batters_adjusted_4$playerID <- droplevels(batters_adjusted_4$playerID)
  
  # taper down average WAR for players with small PAs
  batters_adjusted_4[batters_adjusted_4$mapped_PA <= 20, ]$adj_fWAR <- 
    round(batters_adjusted_4[batters_adjusted_4$mapped_PA <= 20, ]$adj_fWAR/9,2)
  batters_adjusted_4[batters_adjusted_4$mapped_PA <= 20, ]$adj_bWAR <- 
    round(batters_adjusted_4[batters_adjusted_4$mapped_PA <= 20, ]$adj_bWAR/9,2)    
  
  batters_adjusted <- do.call(rbind, mclapply(
  split(batters_adjusted_4, f = droplevels(as.factor(batters_adjusted_4$playerID))), 
  mc.cores = ncores, FUN = function(xx){
    ## natural cubic spline
    #ns_AVG = lm(adj_AVG ~ ns(year, df=6), data=xx)
    #nn_AVG <- predict(ns_AVG, data.frame("year"= xx$year))
    #ns_HR = lm(adj_HR ~ ns(year, df=6), data=xx)
    #nn_HR <- predict(ns_HR, data.frame("year"= xx$year))
    #ns_BB = lm(adj_BB ~ ns(year, df=6), data=xx)
    #nn_BB <- predict(ns_BB, data.frame("year"= xx$year))
    #ns_bWAR = lm(adj_bWAR ~ ns(year, df=6), data=xx)
    #nn_bWAR <- predict(ns_bWAR, data.frame("year"= xx$year))
    #ns_fWAR = lm(adj_fWAR ~ ns(year, df=6), data=xx)
    #nn_fWAR <- predict(ns_fWAR, data.frame("year"= xx$year))
    
    xx %>% mutate(AVG = round(adj_AVG, 3)) %>% 
      mutate(HR = round(adj_HR)) %>%
      mutate(BB = round(adj_BB)) %>%
      mutate(ebWAR = round(adj_bWAR, 2)) %>%
      mutate(efWAR = round(adj_fWAR, 2))
  })) 

  
  bat_season <- batters_adjusted %>% 
    mutate(PA = adj_AB + BB + HBP + SF)
  bat_season[bat_season$HR >= bat_season$adj_AB, ]$HR <- 
    bat_season[bat_season$HR >= bat_season$adj_AB, ]$adj_AB

 bat_season <- bat_season %>% 
   dplyr::select(-c("adj_H", "adj_HR", "adj_BB", "adj_AVG", "adj_OBP", "adj_bWAR", "adj_fWAR", 'mapped_PA'))
 bat_season <- bat_season %>% mutate(OBP = round((AVG * adj_AB + BB + HBP)/(adj_AB + BB + HBP + SF), 3 )) %>%
   mutate(AVG = ifelse(AVG > 0, AVG, 0)) %>%
   mutate(HR = ifelse(HR > 0, HR, 0)) %>%
   mutate(BB = ifelse(BB > 0, BB, 0)) %>%
   mutate(OBP = ifelse(OBP > 0, OBP, 0))
 
 colnames(bat_season)[5] <- 'AB'
 colnames(bat_season)[8] <- 'BA'
 bat_season_para <- bat_season %>% mutate(H = ceiling(AB * BA)) %>%
   mutate(BA = round(H / AB, 3)) %>%
   mutate(BA = ifelse(AB == 0, 0, BA)) %>%
   mutate(OBP = round((H+BB+HBP)/(AB+BB+HBP+SF), 3)) %>%
   mutate(OBP = ifelse(AB+BB+HBP+SF == 0, 0, OBP))
 
  bat_career_para <- bat_season_para %>% group_by(playerID) %>% 
    summarise(name = unique(name), 
              playerID = unique(playerID), 
              PA = sum(round(PA)), 
              AB = sum(AB), 
              H = sum(H), 
              HR = sum(round(HR)), 
              BB = sum(round(BB)), 
              BA = round(H/AB, 3), 
              HBP = sum(HBP), 
              SF = sum(SF), 
              OBP = round((H + BB + HBP)/(AB + BB + HBP + SF), 3), 
              ebWAR = sum(ebWAR), 
              efWAR = sum(efWAR)) %>% ungroup() %>% 
    arrange(desc(ebWAR))
  
  bat_career_para <- bat_career_para %>% mutate(BA = ifelse(AB == 0, 0, BA))
  bat_career_para <- bat_career_para %>% mutate(OBP = ifelse(AB + BB + HBP + SF == 0, 0, OBP))

Reulsts:

batters using non-parametric distribution

The career results for batters uses non-parametric distribution measuring the BA with top 15 bWAR batting leaders.
playerID name PA AB H HR BB BA HBP SF OBP ebWAR efWAR
bondsba01 Barry Bonds 12740 10170 3049 654 2373 0.300 106 91 0.434 153.89 145.24
mayswi01 Willie Mays 12814 11151 3475 577 1528 0.312 44 91 0.394 144.08 135.39
aaronha01 Henry Aaron 14113 12540 3887 689 1420 0.310 32 121 0.378 135.60 128.05
ruthba01 Babe Ruth 10829 8884 2672 702 1902 0.301 43 0 0.426 127.29 120.44
rodrial01 Alex Rodriguez 11917 10293 3047 547 1338 0.296 176 110 0.383 120.29 110.30
musiast01 Stan Musial 13036 11420 3579 492 1510 0.313 53 53 0.394 119.51 113.03
cobbty01 Ty Cobb 12721 11659 3726 247 971 0.320 91 0 0.376 114.48 108.77
pujolal01 Albert Pujols 13201 11432 3522 662 1523 0.308 123 123 0.391 111.86 97.34
schmimi01 Mike Schmidt 10310 8517 2331 561 1606 0.274 79 108 0.390 109.58 106.41
henderi01 Rickey Henderson 13760 11331 3195 286 2264 0.282 98 67 0.404 109.08 103.90
willite01 Ted Williams 10184 8258 2593 503 1867 0.314 39 20 0.442 107.86 107.75
speaktr01 Tris Speaker 12003 10723 3172 195 1179 0.296 101 0 0.371 102.26 95.13
morgajo02 Joe Morgan 11522 9444 2648 308 1942 0.280 40 96 0.402 100.17 96.07
robinfr02 Frank Robinson 11945 10213 3001 535 1432 0.294 198 102 0.388 99.93 95.92
ottme01 Mel Ott 11058 9370 2657 418 1624 0.284 64 0 0.393 99.74 95.72

batters using parametric distribution

The career results for batters uses parametric distribution measuring the BA with top 15 bWAR batting leaders.
playerID name PA AB H HR BB BA HBP SF OBP ebWAR efWAR
bondsba01 Barry Bonds 12740 10170 3044 654 2373 0.299 106 91 0.434 153.89 145.24
mayswi01 Willie Mays 12814 11151 3457 577 1528 0.310 44 91 0.392 144.08 135.39
aaronha01 Henry Aaron 14113 12540 3923 689 1420 0.313 32 121 0.381 135.60 128.05
ruthba01 Babe Ruth 10829 8884 2692 702 1902 0.303 43 0 0.428 127.29 120.44
rodrial01 Alex Rodriguez 11917 10293 3039 547 1338 0.295 176 110 0.382 120.29 110.30
musiast01 Stan Musial 13036 11420 3589 492 1510 0.314 53 53 0.395 119.51 113.03
cobbty01 Ty Cobb 12721 11659 3876 247 971 0.332 91 0 0.388 114.48 108.77
pujolal01 Albert Pujols 13201 11432 3510 662 1523 0.307 123 123 0.391 111.86 97.34
schmimi01 Mike Schmidt 10310 8517 2339 561 1606 0.275 79 108 0.390 109.58 106.41
henderi01 Rickey Henderson 13760 11331 3180 286 2264 0.281 98 67 0.403 109.08 103.90
willite01 Ted Williams 10184 8258 2596 503 1867 0.314 39 20 0.442 107.86 107.75
speaktr01 Tris Speaker 12003 10723 3223 195 1179 0.301 101 0 0.375 102.26 95.13
morgajo02 Joe Morgan 11522 9444 2652 308 1942 0.281 40 96 0.402 100.17 96.07
robinfr02 Frank Robinson 11945 10213 3010 535 1432 0.295 198 102 0.388 99.93 95.92
ottme01 Mel Ott 11058 9370 2662 418 1624 0.284 64 0 0.393 99.74 95.72

We apply our Full House Model to the bWAR, fWAR, SO, ERA for pitchers using the non-parametric distribution to measure the components.

pitchers

The career results for pitchers with top 15 bWAR pitching leaders.
playerID name IP ER ERA K ebWAR efWAR
clemero02 Roger Clemens 5456 1701 2.81 4752 145.88 141.25
maddugr01 Greg Maddux 5646 1739 2.77 3473 113.66 120.73
johnsra05 Randy Johnson 4724 1521 2.90 5136 110.81 109.77
seaveto01 Tom Seaver 4587 1482 2.91 3656 104.31 90.78
grovele01 Lefty Grove 3518 1085 2.78 2826 102.54 98.80
verlaju01 Justin Verlander 4107 1283 2.81 3297 100.23 95.07
blylebe01 Bert Blyleven 4877 1721 3.18 3785 97.69 101.82
niekrph01 Phil Niekro 5082 1776 3.15 3364 94.37 77.47
kershcl01 Clayton Kershaw 3484 941 2.43 2980 93.78 88.83
johnswa01 Walter Johnson 4791 1766 3.32 3888 91.53 91.80
spahnwa01 Warren Spahn 5112 1832 3.23 2955 91.20 72.30
scherma01 Max Scherzer 3658 1150 2.83 3506 90.63 82.14
greinza01 Zack Greinke 4278 1433 3.01 2916 90.23 80.28
perryga01 Gaylord Perry 4977 1768 3.20 3366 89.50 94.45
carltst01 Steve Carlton 4816 1641 3.07 4221 88.70 100.34

Top 25 greatest baseball players from different scources.

rank ebWAR efWAR bWAR fWAR ESPN Hall of Stats
1 Barry Bonds Barry Bonds Babe Ruth Babe Ruth Babe Ruth Babe Ruth
2 Roger Clemens Roger Clemens Walter Johnson Barry Bonds Willie Mays Barry Bonds
3 Willie Mays Willie Mays Cy Young Willie Mays Hank Aaron Walter Johnson
4 Babe Ruth Henry Aaron Barry Bonds Ty Cobb Ty Cobb Willie Mays
5 Henry Aaron Greg Maddux Willie Mays Honus Wagner Ted Williams Cy Young
6 Alex Rodriguez Babe Ruth Ty Cobb Hank Aaron Lou Gehrig Ty Cobb
7 Stan Musial Stan Musial Hank Aaron Roger Clemens Mickey Mantle Hank Aaron
8 Ty Cobb Alex Rodriguez Roger Clements Cy Young Barry Bonds Roger Clemens
9 Greg Maddux Randy Johnson Tris Speaker Tris Speaker Walter Johnson Rogers Hornsby
10 Albert Pujols Ty Cobb Honus Wagner Ted Williams Stan Musial Houns Wagner
11 Randy Johnson Nolan Ryan Stan Musial Rogers Hornsby Pedro Martinez Tris Speaker
12 Mike Schmidt Ted Williams Rogers Hornsby Stan Musial Honus Wagner Ted Williams
13 Rickey Henderson Mike Schmidt Eddie Collins Eddie Collins Ken Griffey Jr.  Stan Musial
14 Ted Williams Rickey Henderson Ted Williams Walter Johnson Greg Maddux Eddie Collins
15 Tom Seaver Bert Blyleven Pete Alexander Greg Maddux Mike Trout Pete Alexander
16 Lefty Grove Steve Carlton Alex Rodrigues Lou Gehrig Joe DiMaggio Alex Rodriguez
17 Tris Speaker Lefty Grove Kid Nichols Alex Rodriguez Roger Clemens Lou Gehrig
18 Justin Verlander Albert Pujols Lou Gehrig Mickey Mantle Mike Schmidt Mickey Mantle
19 Joe Morgan Joe Morgan Rickey Herderson Mel Ott Frank Robinson Lefty Grove
20 Frank Robinson Frank Robinson Mel Ott Randy Johnson Rogeres Hornsby Mel Ott
21 Mel Ott Mel Ott Mickey Mantle Nolan Ryan Cy Young Rickey Henderson
22 Bert Blyleven Tris Speaker Tom Seaver Mike Schmidt Tom Seaver Kid Nichols
23 Cal Ripken Jr Justin Verlander Frank Robinson Rickey Henderson Rickey Henderson Mike Schmidt
24 Rogers Hornsby Gaylord Perry Nap Lajole Frank Robinson Randy Johnson Nap Lajoie
25 Lou Gehrig Rogers Hornsby Mike Schmidt Bert Blyleven Christy Mathewson Christy Mathewson

BA rankings from different sources.

rank Peak in Full House Career in Full House Schell Method 1 Schell Method 2 Era-bridging Method Raw Career
1 Rod Carew Tony Gwynn Tony Gwynn Tony Gwynn Ty Cobb Ty Cobb
2 Ichiro Suzuki Rod Carew Ty Cobb Ty Cobb Tony Gwynn Rogers Hornsby
3 Jose Altuve Jose Altuve Rod Carew Rod Carew Ted Williams Shoeless Joe Jackson
4 Albert Pujols Ichiro Suzuki Shoeless Joe Jackson Rogers Hornsby Wade Boggs Lefty O’Doul
5 Joe Mauer Miguel Cabrera Rogers Hornsby Stan Musial Rod Carew Ed Delahanty
6 Josh Hamilton Roberto Clemente Ted Williams Nap Lajoie Shoeless Jos Jackson Tris Speaker
7 Miguel Cabrera Ty Cobb Honus Wagner Shoeless Joe Jackson Nap Lajoie Bill Hamilton
8 Trea Turner Joe DiMaggio Stan Musial Honus Wagner Stan Musial Ted Williams
9 Harry Walker Wade Boggs Wade Boggs Ted Williams Frank Thomas Dan Brouthers
10 Jeff McNeil Buster Posey Nap Lajoie Wade Boggs Ed Delahanty Babe Ruth
11 Tony Gwynn Mike Trout Tris Speaker Pete Browning Tris Speaker Dave Orr
12 John Olerud Freddie Freeman Pete Browning Tris Speaker Rogers Hornsby Harry Heilmann
13 José Reyes Joe Mauer Willie Mays Mike Piazza Hank Aaron Pete Browning
14 Alex Rodriguez Ted Williams Dan Brouthers Dan Brouthers Alex Rodriguez Willie Keeler
15 Keith Hernandez Stan Musial Kirby Puckett Tip O’Neill Pete Rose Bill Terry
16 Mookie Betts Willie Mays Babe Ruth Kirby Puckett Honus Wagner Lou Gehrig
17 Pete Rose Bill Terry Tip O’Neill Tony Oliva Roberto Clements George Sisler
18 Dee Strange Gordon Robinson Canó Willie Keeler Vladimir Guerrero George Brett Jesse Burkett
19 Edgar Martinez Henry Aaron Joe DiMaggio Mike Donlin Don Mattingsly Tony Gwynn
20 Stan Musial Matty Alou Tony Oliva Willie Keeler Kirby Puckett Nap Lajoie
21 Ken Griffey Vladimir Guerrero Jesse Burkett Edgar Martinez Mike Piazza Jake Stenzel
22 Willie McGee Derek Jeter Eddie Collins Henry Aaron Eddie Collins Riggs Stephenson
23 Luis Arraez Al Oliver George Sisler Derek Aaron Edgar Martinez Al Simmons
24 Robin Yount Lou Gehrig Lou Gehrig Joe DiMaggio Paul Molitor Cap Anson
25 Derrek Lee Edgar Martinez Don Mattingly Babe Ruth Willie Mays John McGraw

HR rankings from different sources.

rank Peak in Full House Career in Full House Era-bridging Method PPS detrending method Peak in Schell Career in Schell Raw AB per HR
1 Babe Ruth Babe Ruth Mark McGwire Babe Ruth Barry Bonds Babe Ruth Mark McGwire
2 Willie Stargell Mark McGwire Juan Gonzalez Mel Ott Babe Ruth Mark McGwire Babe Ruth
3 Willie Mays Giancarlo Stanton Babe Ruth Lou Gehrig Mark McGwire Ted Williams Barry Bonds
4 Aaron Judge Dave Kingman Dave Kingman Jimmie Foxx Buck Frecman Barry Bonds Jim Thome
5 Giancarlo Stanton Ralph Kiner Mike Schmidt Hank Aaron Ed Delahanty Mike Schmidt Ralph Kiner
6 José Bautista Mike Schmidt Harmon Killebrew Rogers Hornsby Tim Jordan Lou Gehrig Harmon Killebrew
7 Mark McGwire Willie Stargell Frank Thomas Cy Williams Willie Stargell Harmon Killebrew Sammy Sosa
8 Chris Davis Barry Bonds Jose Canseco Barry Bonds Rogers Hornsby Jimmie Foxx Ted Williams
9 Luke Voit Jimmie Foxx Ron Kittle Willie Mays Jim Thome Dave Kingman Manny Ramirez
10 Ted Williams Mike Trout Willie Stargell Ted Williams Dave Kingman Reggic Jackson Adam Dunn
11 Eddie Mathews Ted Williams Willie McCovey Reggie Jackson Roy Sievers Bill Nicholson Ryan Howard
12 Khris Davis David Ortiz Darryl Strawberry Mike Schmidt Jeff Bagwell Mickey Mantle Juan Gonzalez
13 Bryce Harper Willie McCovey Bo Jackson Frank Robinson Ted Williams Ralph Kiner Dave Kingman
14 Mike Schmidt Harmon Killebrew Ted Williams Harmon Killebrew Kevin Mitchell Joe DiMaggio Russell Branyan
15 David Ortiz Mickey Mantle Ralph Kiner Gavvy Cravath Mike Schmidt Willie Stargell Mickey Mantle
16 Kevin Mitchell Hank Greenberg Pat Seerey Honus Wagner Lou Gehrig Hack Wilson Alex Rodriguez
17 Albert Pujols Darryl Strawberry Reggie Jackson Willie McCovey Fred Dunlap Rogers Hornsby Jimmie Foxx
18 Mickey Mantle Jose Canseco Ken Griffey Harry Stovey Harry Stovey Darryl Strawberry Mike Schmidt
19 Dave Kingman Lou Gehrig Albert Belle Ken Griffey Jr.  Charlie Hickman Willie McCovey Jose Canseco
20 Gorman Thomas Jim Thome Dick Allen Stan Musial Bill Nicholson Glenn Davis Albert Belle
21 George Foster Eddie Mathews Barry Bonds Willie Stargell Boog Powell Wally Berger Khris Davis
22 Johnny Bench Reggie Jackson Dean Palmer Eddie Murray Joe DiMaggio Eddie Mathews Ron Kittle
23 Darrell Evans Ryan Howard Hank Aaron Mark McGwire Eddie Mathews Harry Stovey Carlos Delgado
24 Andruw Jones Albert Pujols Jimmie Foxx Mickey Mantle Mickey Mantle Frank Howard Ken Griffey Jr. 
25 Reggie Jackson Hank Sauer Mike Piazza Al Simmons Tris Speaker Mel Ott Hank Greenberg

OBP rankings from different sources.

rank Career Full House Schell Method Raw Career
1 Ted Williams Ted Williams Ted Williams
2 Mike Trout Babe Ruth Babe Ruth
3 Barry Bonds Rogers Hornsby John McGraw
4 Joey Votto Barry Bonds Billy Hamilton
5 Babe Ruth John McGraw Lou Gehrig
6 Mickey Mantle Billy Hamilton Barry Bonds
7 Bryce Harper Topsy Hartsel Bill Joyce
8 Lou Gehrig Mel Ott Jud Wilson
9 Frank Thomas Roy Thomas Rogers Hornsby
10 Freddie Freeman Mickey Mantle Ty Cobb
11 Edgar Martinez Wade Boggs Jimmie Foxx
12 Lance Berkman Frank Thomas Tris Speaker
13 Paul Goldschmidt Lou Gehrig Eddie Collins
14 Wade Boggs Rickey Henderson Ferris Fain
15 Rickey Henderson Stan Musial Dan Brouthers
16 Jason Giambi Edgar Martinez Max Bishop
17 Joe Mauer Ty Cobb Shoeless Joe Jackson
18 Miguel Cabrera Dan Brouthers Mickey Mantle
19 Joe Morgan Tris Speaker Mickey Cochrane
20 Prince Fielder Joe Cunningham Frank Thomas
21 Brian Giles George Gore Edgar Martinez
22 Mike Hargrove Eddie Collins Turkey Stearnes
23 Manny Ramirez Ross Youngs Stan Musial
24 Jeff Bagwell Mike Hargrove Cupid Childs
25 Jim Thome Jeff Bagwell Wade Boggs

Table 1: The percent of seasons in which BA failed the Shapiro-Wilk test based on different p-value thresholds in the Supplementary materials.

p_value prop_season
0.05 0.15
0.10 0.24
0.20 0.34
0.30 0.39

Histogram of the p-values from Shapiro-Wilk test of normality on the BA distribution in each season in the Supplementary materials.

Table 1: Top 25 era-adjusted BA, OBP, HR, bWAR and fWAR leaders in the manuscript

rank name BA name OBP name HR
1 Tony Gwynn 0.342 Ted Williams 0.442 Babe Ruth 702
2 Rod Carew 0.329 Mike Trout 0.438 Henry Aaron 689
3 Jose Altuve 0.327 Barry Bonds 0.434 Albert Pujols 662
4 Ichiro Suzuki 0.327 Joey Votto 0.433 Barry Bonds 654
5 Miguel Cabrera 0.32 Babe Ruth 0.426 Reggie Jackson 578
6 Roberto Clemente 0.32 Mickey Mantle 0.42 Willie Mays 577
7 Ty Cobb 0.32 Bryce Harper 0.417 Mike Schmidt 561
8 Joe DiMaggio 0.318 Lou Gehrig 0.415 Alex Rodriguez 547
9 Wade Boggs 0.316 Frank Thomas 0.411 Frank Robinson 535
10 Buster Posey 0.316 Freddie Freeman 0.41 Ken Griffey Jr 528
11 Mike Trout 0.315 Edgar Martinez 0.41 Willie Stargell 528
12 Freddie Freeman 0.314 Lance Berkman 0.407 David Ortiz 521
13 Joe Mauer 0.314 Paul Goldschmidt 0.407 Willie McCovey 515
14 Ted Williams 0.314 Wade Boggs 0.405 Harmon Killebrew 508
15 Stan Musial 0.313 Christian Yelich 0.405 Ted Williams 503
16 Willie Mays 0.312 Rickey Henderson 0.404 Mickey Mantle 502
17 Bill Terry 0.312 Jason Giambi 0.403 Eddie Mathews 502
18 Robinson Canó 0.311 Joe Mauer 0.403 Eddie Murray 498
19 Henry Aaron 0.31 Miguel Cabrera 0.402 Jimmie Foxx 493
20 Matty Alou 0.31 Joe Morgan 0.402 Stan Musial 492
21 Vladimir Guerrero 0.31 Prince Fielder 0.4 Dave Winfield 491
22 Derek Jeter 0.31 Brian Giles 0.4 Mark McGwire 489
23 Al Oliver 0.31 Mike Hargrove 0.4 Jim Thome 484
24 Lou Gehrig 0.309 Manny Ramirez 0.4 Miguel Cabrera 480
25 Edgar Martinez 0.309 Jeff Bagwell 0.399 Lou Gehrig 479
pre-1950 in top 10 2 3 1
pre-1950 in top 25 6 3 5
proportion before 1950 0.259 0.259 0.259
talent pool 1871-2012 1871-2012 1871-2012
chance in top 10 1 in 1.29 1 in 1.99 1 in 1.05
chance in top 10 1 in 1.51 1 in 1.03 1 in 1.23
rank name ebWAR name efWAR
1 Barry Bonds 153.89 Barry Bonds 145.24
2 Willie Mays 144.08 Willie Mays 135.39
3 Henry Aaron 135.6 Henry Aaron 128.05
4 Babe Ruth 127.29 Babe Ruth 120.44
5 Alex Rodriguez 120.29 Stan Musial 113.03
6 Stan Musial 119.51 Alex Rodriguez 110.3
7 Ty Cobb 114.48 Ty Cobb 108.77
8 Albert Pujols 111.86 Ted Williams 107.75
9 Mike Schmidt 109.58 Mike Schmidt 106.41
10 Rickey Henderson 109.08 Rickey Henderson 103.9
11 Ted Williams 107.86 Albert Pujols 97.34
12 Tris Speaker 102.26 Joe Morgan 96.07
13 Joe Morgan 100.17 Frank Robinson 95.92
14 Frank Robinson 99.93 Mel Ott 95.72
15 Mel Ott 99.74 Tris Speaker 95.13
16 Cal Ripken Jr 97.39 Rogers Hornsby 94.42
17 Rogers Hornsby 97.01 Mickey Mantle 94.3
18 Lou Gehrig 95.87 Cal Ripken Jr 93.24
19 Mickey Mantle 95.37 Lou Gehrig 92.98
20 Carl Yastrzemski 95.2 Carl Yastrzemski 92.64
21 Adrián Beltré 95.01 Honus Wagner 89.81
22 Wade Boggs 92.51 Wade Boggs 87.91
23 Roberto Clemente 91.37 Mike Trout 87.87
24 Eddie Collins 90.94 Eddie Mathews 86.38
25 Mike Trout 90.43 Adrián Beltré 86.34
pre-1950 in top 10 3 4
pre-1950 in top 25 9 9
proportion before 1950 0.263 0.263
chance in top 10 1 in 1.95 1 in 3.92
chance in top 10 1 in 5.31 1 in 5.31

Table 2: Top 25 era-adjusted IP, ERA, SO. bWAR and fWAR leaders in the manuscript.

rank name IP name ERA name SO
1 Greg Maddux 5646 Clayton Kershaw 2.43 Nolan Ryan 6026
2 Roger Clemens 5456 Pedro Martinez 2.61 Randy Johnson 5136
3 Nolan Ryan 5319 Greg Maddux 2.77 Roger Clemens 4752
4 Warren Spahn 5112 Lefty Grove 2.78 Steve Carlton 4221
5 Phil Niekro 5082 Roger Clemens 2.81 Walter Johnson 3888
6 Don Sutton 5071 Justin Verlander 2.81 Bert Blyleven 3785
7 Gaylord Perry 4977 Max Scherzer 2.83 Tom Seaver 3656
8 Tom Glavine 4917 Roy Halladay 2.85 Don Sutton 3575
9 Bert Blyleven 4877 Randy Johnson 2.9 Max Scherzer 3506
10 Steve Carlton 4816 Tom Seaver 2.91 Greg Maddux 3473
11 Walter Johnson 4791 Cole Hamels 2.94 Gaylord Perry 3366
12 Randy Johnson 4724 Carl Hubbell 2.94 Phil Niekro 3364
13 Cy Young 4663 Whitey Ford 2.96 Justin Verlander 3297
14 Tom Seaver 4587 Curt Schilling 2.96 Pedro Martinez 3113
15 Tommy John 4585 John Smoltz 2.96 John Smoltz 3106
16 Jamie Moyer 4585 Bob Gibson 2.97 Bob Feller 3104
17 Robin Roberts 4435 Jim Palmer 2.97 Fergie Jenkins 3088
18 Pete Alexander 4356 Zack Greinke 3.01 Curt Schilling 3036
19 Zack Greinke 4278 Tim Hudson 3.03 Clayton Kershaw 2980
20 Early Wynn 4277 Juan Marichal 3.05 CC Sabathia 2960
21 Fergie Jenkins 4199 Steve Carlton 3.07 Warren Spahn 2955
22 Dennis Martinez 4182 Tom Glavine 3.1 Zack Greinke 2916
23 CC Sabathia 4169 Félix Hernández 3.1 Frank Tanana 2849
24 Justin Verlander 4107 Kevin Brown 3.12 Bob Gibson 2836
25 Jack Morris 4039 Adam Wainwright 3.12 Lefty Grove 2826
pre-1950 in top 10 1 1 1
pre-1950 in top 25 6 2 4
proportion before 1950 0.298 0.28 0.28
talent pool 1871-2006 1871-2012 1871-2006
chance in top 10 1 in 1.03 1 in 1.04 1 in 1.04
chance in top 10 1 in 1.25 1 in 1 1 in 1.05
rank name ebWAR name efWAR
1 Roger Clemens 145.88 Roger Clemens 141.25
2 Greg Maddux 113.66 Greg Maddux 120.73
3 Randy Johnson 110.81 Randy Johnson 109.77
4 Tom Seaver 104.31 Nolan Ryan 108.3
5 Lefty Grove 102.54 Bert Blyleven 101.82
6 Justin Verlander 100.23 Steve Carlton 100.34
7 Bert Blyleven 97.69 Lefty Grove 98.8
8 Phil Niekro 94.37 Justin Verlander 95.07
9 Clayton Kershaw 93.78 Gaylord Perry 94.45
10 Walter Johnson 91.53 Walter Johnson 91.8
11 Warren Spahn 91.2 Cy Young 91.28
12 Max Scherzer 90.63 Tom Seaver 90.78
13 Zack Greinke 90.23 Clayton Kershaw 88.83
14 Gaylord Perry 89.5 Don Sutton 82.98
15 Steve Carlton 88.7 Max Scherzer 82.14
16 Pedro Martinez 87.2 Pedro Martinez 82.13
17 Nolan Ryan 86.85 Zack Greinke 80.28
18 Mike Mussina 84.62 Mike Mussina 80.08
19 Curt Schilling 82.09 John Smoltz 79.36
20 Tom Glavine 81.89 Pete Alexander 77.55
21 Robin Roberts 79.01 Phil Niekro 77.47
22 Fergie Jenkins 77.83 Curt Schilling 77
23 Bob Gibson 77 Bob Gibson 76.45
24 Roy Halladay 76.07 Fergie Jenkins 75.97
25 CC Sabathia 74.5 Tommy John 75.8
pre-1950 in top 10 2 2
pre-1950 in top 25 4 4
proportion before 1950 0.28 0.28
chance in top 10 1 in 1.22 1 in 1.22
chance in top 10 1 in 1.05 1 in 1.05

Table 3: Top 25 careers according to era-adjusted bWAR (ebWAR), era-adjusted fWAR (efWAR), era-adjusted JAWS computed using bWAR (ebJAWS), and era-adjusted JAWS computed using fWAR (efJAWS) in the manuscript

name ebWAR name ebJAWS name efWAR name efJAWS
Barry Bonds 153.89 Barry Bonds 109.14 Barry Bonds 145.24 Roger Clemens 103.54
Roger Clemens 145.88 Roger Clemens 107.47 Roger Clemens 141.25 Barry Bonds 103.21
Willie Mays 144.08 Willie Mays 105.16 Willie Mays 135.39 Willie Mays 98.72
Babe Ruth 137.98 Babe Ruth 100.70 Henry Aaron 128.05 Babe Ruth 90.37
Henry Aaron 135.60 Henry Aaron 95.11 Greg Maddux 120.73 Henry Aaron 89.52
Alex Rodriguez 120.29 Alex Rodriguez 91.66 Babe Ruth 120.28 Greg Maddux 88.51
Stan Musial 119.51 Stan Musial 88.38 Stan Musial 113.03 Randy Johnson 85.78
Ty Cobb 114.48 Randy Johnson 88.20 Alex Rodriguez 110.30 Alex Rodriguez 84.35
Greg Maddux 113.66 Albert Pujols 86.33 Randy Johnson 109.77 Stan Musial 83.61
Albert Pujols 111.86 Greg Maddux 85.60 Ty Cobb 108.77 Ted Williams 82.72
Randy Johnson 110.81 Mike Schmidt 84.52 Nolan Ryan 108.30 Mike Schmidt 82.20
Mike Schmidt 109.58 Lefty Grove 84.52 Ted Williams 107.75 Lefty Grove 79.31
Rickey Henderson 109.08 Ted Williams 83.54 Mike Schmidt 106.41 Ty Cobb 78.85
Ted Williams 107.86 Ty Cobb 82.62 Rickey Henderson 103.90 Rickey Henderson 78.83
Tom Seaver 104.31 Rickey Henderson 82.14 Bert Blyleven 101.82 Steve Carlton 78.32
Lefty Grove 102.54 Justin Verlander 80.34 Steve Carlton 100.34 Nolan Ryan 76.89
Tris Speaker 102.26 Joe Morgan 79.03 Lefty Grove 98.80 Albert Pujols 75.80
Justin Verlander 100.23 Tom Seaver 78.51 Albert Pujols 97.34 Justin Verlander 75.29
Joe Morgan 100.17 Cal Ripken Jr 77.54 Joe Morgan 96.07 Joe Morgan 75.14
Frank Robinson 99.93 Mike Trout 77.26 Frank Robinson 95.92 Bert Blyleven 75.12
Mel Ott 99.74 Rogers Hornsby 76.61 Mel Ott 95.72 Rogers Hornsby 74.27
Bert Blyleven 97.69 Lou Gehrig 75.70 Tris Speaker 95.13 Mike Trout 73.90
Cal Ripken Jr 97.39 Wade Boggs 75.61 Justin Verlander 95.07 Mickey Mantle 73.71
Rogers Hornsby 97.01 Clayton Kershaw 75.12 Gaylord Perry 94.45 Cal Ripken Jr 73.62
Lou Gehrig 95.87 Mickey Mantle 75.09 Rogers Hornsby 94.42 Lou Gehrig 73.41

Table 4: Top 25 four-year peaks by batting average and at bats per home run (minimum 2000 era-adjusted plate appearances) in the manuscript

name year BA name year ABpHR
Jose Altuve 2014-2017 0.367 Barry Bonds 2001-2004 10.86
Tony Gwynn 1994-1997 0.366 Mark McGwire 1995-1998 11.15
Rod Carew 1974-1977 0.363 Babe Ruth 1918-1921 11.35
Miguel Cabrera 2010-2013 0.355 Giancarlo Stanton 2014-2017 11.85
Wade Boggs 1985-1988 0.353 Albert Pujols 2008-2011 12.20
Ichiro Suzuki 2001-2004 0.353 Eddie Mathews 1953-1956 12.34
Barry Bonds 2001-2004 0.352 Willie Stargell 1970-1973 12.45
Joe Mauer 2006-2009 0.350 Jose Canseco 1988-1991 12.46
Roberto Clemente 1964-1967 0.345 Mike Schmidt 1980-1983 12.57
Joe DiMaggio 1938-1941 0.345 José Bautista 2010-2013 12.68
Albert Pujols 2003-2006 0.343 Gorman Thomas 1978-1981 12.80
Don Mattingly 1984-1987 0.341 Ralph Kiner 1949-1952 12.87
Mike Piazza 1995-1998 0.341 Khris Davis 2015-2018 12.89
Willie Mays 1957-1960 0.340 Aaron Judge 2020-2023 13.25
Matty Alou 1966-1969 0.339 Ted Williams 1944-1947 13.26
Tim Anderson 2019-2022 0.338 Sammy Sosa 1998-2001 13.55
Stan Musial 1943-1946 0.338 Frank Howard 1967-1970 13.56
Rogers Hornsby 1922-1925 0.335 Mickey Mantle 1960-1963 13.66
Ted Williams 1943-1946 0.335 David Ortiz 2012-2015 13.71
Ty Cobb 1912-1915 0.334 Nelson Cruz 2017-2020 13.88
Trea Turner 2019-2022 0.334 Jimmie Foxx 1937-1940 13.91
Henry Aaron 1956-1959 0.333 Carlos Pena 2007-2010 13.91
Cecil Cooper 1980-1983 0.333 Dave Kingman 1976-1979 13.97
Freddie Freeman 2020-2023 0.333 Darryl Strawberry 1985-1988 13.98
Nap Lajoie 1901-1904 0.333 Jim Thome 2001-2004 14.01

Table 5: Comparison of the the top three era-adjusted seasons by Babe Ruth and Barry Bonds according to home run rate (AB per HR) in the manuscript

name year ABpHR diff ystar balance u n p_beta pop HR_talent
Babe Ruth 1919 10.95 0.0662271 0.0010814 0.9683747 0.9997320 118 0.9688654 2020530 5355122
Babe Ruth 1920 10.88 0.0508460 0.0003966 0.9846381 0.9998818 130 0.9847546 2517182 12061976
Babe Ruth 1926 10.83 0.0412598 0.0007058 0.9669191 0.9997455 130 0.9674563 3499956 8272038
Barry Bonds 2001 10.86 0.0661758 0.0014000 0.9594065 0.9998329 243 0.9602162 11200119 18901295
Barry Bonds 2002 10.83 0.0259097 0.0013214 0.9074378 0.9996206 244 0.9115765 11725596 9660871
Barry Bonds 2004 10.76 0.0373425 0.0016128 0.9204885 0.9996714 242 0.9235554 12829045 11901386

Histogram of the best 100 and 1000 players by their rookie year

Discussion

Year Effect on the Talent of the Replacement-level Batters

The year effect for bWAR from the talent perspective and the era-adjusted bWAR perspective will be covered in this section.

The two figures below show that the bWAR and fWAR talent of the replacement-level batters from 1871 to 2023 The line in the plot is the smoothed line after apply natural cubic spline method. The changing pattern is the similar for the talent of the replacement-level batters and talent pool. The seasons that deviate from the smoothed line are those associated with strikes and World War II, such as 1943 - 1946, 1981, 1994, 1995 and 2020 seasons.

Year Effect on the Era-adjusted WAR of the Hypothetical Batters from 2023

In this section we calculate the bWAR talent of a hypothetical 2023 hitter with 0 bWAR. Then, we compute his era-adjusted bWAR by mapping this hypothetical player to the other seasons from 1871 through 2023.

The figure above shows the era-adjusted bWAR values over time corresponding to a hypothetical batter with 0 bWAR in 2023 using the Full House Model. The fall in the mid-2000s corresponds to increase of talent pool. Despite the sharp decline in era-adjusted bWAR values in 1981 season, it is untrue that the 1981 batters with 0 bWAR would perform better than the 2023 batters. This is because the 1981 season was brief and a large number of batters underperformed replacement-level players. The problem of the sharp decline in era-adjusted bWAR values in the 1981 season is resolved when we look at the era-adjusted bWAR and era-adjusted bWAR per game of the hypothetical hitters from 2023 with 2 bWAR.

The two figures below shows the era-adjusted bWAR and era-adjusted bWAR per game of the hypothetical hitters from 2023 with 2 bWAR.

Then we also calculate the fWAR talent of a hypothetical 2023 hitter with 0 fWAR. Then, we compute his era-adjusted fWAR by mapping this hypothetical player to the other seasons from 1871 through 2023.

The figure above shows the era-adjusted fWAR values over time corresponding to a hypothetical batter with 0 fWAR in 2023 using the Full House Model. The fall in the mid-2000s corresponds to increase of talent pool. The problem of the sharp decline in era-adjusted fWAR values in the 1981 season can also be resolved when we look at the era-adjusted fWAR and era-adjusted fWAR per game of the hypothetical hitters from 2023 with 2 fWAR.

The two figures below show he era-adjusted fWAR and era-adjusted fWAR per game of the hypothetical hitters from 2023 with 2 fWAR.

Expansion Effect

It is actually supposed that the Major League’s expansion effect has a significant impact on how the talent scores vary over time and the magnitude of talent will be diluted with the expansion. However, after adding some hypothetical players who had poor performance in the early seasons, when there were fewer players, we discover that the real players did not much benefit from the adjustment of adding hypothetical players. Additionally, we make an effort to establish replacement player baselines throughout all seasons and adjust these baselines at the same level. However, the baselines we construct for every season are quite similar, and the talents of the players from earlier eras do not get much improved.

We also perform several simulations to examine the effect of season size. Using the same talent-generating process, we randomly generate different numbers of components from the same distribution and compare the top 1 talent score, top 50 talent scores, top 100 talent scores and top 300 talent scores. In each simulation, we generate two groups of components from the standard normal distribution, one consists 600 components and the other consists 300 components. Then we record the number of largest talents that group 1 is larger than the group 2. We run this simulation 1000 times and diplay the distribution of the result from each simulation. These graphs below show that top talent scores for various amounts of components from the same distribution are identical.

We also run another simulation to test the effect of season size. Instead of generating 600 components in the group 1, I generate 900 components and repeat the rest of the simulation. Then we have the similar results.

In this part, we compare the BA before and in the expansion seasons and see if there is a significant difference between them. The expansion seasons are collected from Wikipedia and they are 1879, 1892, 1900, 1901, 1961, 1962, 1969, 1977, 1993, and 1998. We compute the average and standard deviation of full-time batters’ BA before and in the expansion seasons and the results are shown below.

yearID mean sd yearID mean sd
1878 0.281 0.043 1879 0.270 0.041
1891 0.264 0.027 1892 0.259 0.029
1899 0.293 0.037 1900 0.291 0.034
1900 0.291 0.034 1901 0.284 0.039
1960 0.268 0.025 1961 0.272 0.029
1961 0.272 0.029 1962 0.271 0.026
1968 0.250 0.027 1969 0.261 0.029
1976 0.264 0.030 1977 0.273 0.028
1992 0.265 0.026 1993 0.276 0.028
1997 0.276 0.027 1998 0.276 0.027

The average and standard error of the difference between the full-time batters’ BA before and in the expansion seasons are shown below.

avg se
-9e-04 0.0024288

We also show Willie Mays’s seasonal bWAR per game in his 22 years MLB career from 1951 to 1973 season except the 1953 season. The red dots represent the seasons that the MLB were experiencing the expansion. The result show that the expansion did not have any significant positive or negative effects on Willie Mays’s prime season or the tail of his career.

We also show Randy Johnson’s seasonal bWAR per game in his 22 years MLB career from 1988 to 2009 season. The red dots represent the seasons that the MLB were experiencing the expansion. The result show that the expansion did not have any significant positive or negative effects on Randy Johnson’s prime season.

4 moments of BA

This figure below shows that four statistical moments of the batting average distribution from 1871 to 2021 season. Points in red correspond to seasons surrounding the peak of WWII (1941-1946).

Distribution Sensitity Analysis

Figure 2: Distribution sensitivity analysis in the Supplementary materials.

Table 3: Comparsion between deteriorated estimation regime and Z-scores in the Supplementary materials.

correct Pareto dist incorrect Pareto dist normal dist folded normal dist
beat or ties 1 1 1.000 1
strictly beat 1 1 0.995 1

Z-scores for 1997 Gwynn and 1911 Cobb from park factor adjusted BAs and regular unadjusted BAs.

name year zscore_BA normality_BA zscore_obs_BA normality_obs_BA
Ty Cobb 1911 3.425657 0.1625329 3.460512 0.0436831
Tony Gwynn 1997 3.822984 0.0167243 3.339753 0.1744079

talent pool Calculation

Based on the talent pool computed from the supplementary materials, we are able to obtain the proportion of the talent pool before 1950 in different season spans. For example, we would like to calculate the proportion of the talent pool before 1950 from 1871 to 2006. Given the talent pool are evenly distribution from the age 20 to 29, the cumulative MLB-eligible males aged 20 to 29 from 1871 to 1879 is equal to 90% of the talent pool in 1871. Similarly, the cumulative MLB-eligible males aged 20 to 29 from 2000 to 2006 is equal to 60% of the talent pool in 2006. The code below shows how to compute the proportion of the talent pool before 1950 from 1871 to 2006.

MLBpops <- bat_dat %>% group_by(yearID) %>% summarise(pops = unique(pops)) %>% group_by(yearID) %>%
  summarise(population = round(mean(pops), 2))
# MLBpops contains the talent pool from 1871 to 2006
n <- MLBpops[MLBpops$yearID %in% c(1871, seq(1880, 2000, 10), 2006),]
p <- n
p$population[1] <- n$population[1] /10*9
p$population[15] <- n$population[15] /10*6
o <- p %>% mutate(population = round(population, 2)) %>% 
  mutate(cpp = round(cumsum(p$population)/(sum(p$population)), 3))

kable(o[o$yearID == 1950,c(1,3)]) %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
yearID cpp
1950 0.292

Top 25 bWAR and fWAR leaders for MLB players with era-adjusted bWAR and fWAR

name ebWAR name efWAR
Barry Bonds 153.89 Barry Bonds 145.24
Roger Clemens 145.88 Roger Clemens 141.25
Willie Mays 144.08 Willie Mays 135.39
Babe Ruth 137.98 Henry Aaron 128.05
Henry Aaron 135.60 Greg Maddux 120.73
Alex Rodriguez 120.29 Babe Ruth 120.28
Stan Musial 119.51 Stan Musial 113.03
Ty Cobb 114.48 Alex Rodriguez 110.30
Greg Maddux 113.66 Randy Johnson 109.77
Albert Pujols 111.86 Ty Cobb 108.77
Randy Johnson 110.81 Nolan Ryan 108.30
Mike Schmidt 109.58 Ted Williams 107.75
Rickey Henderson 109.08 Mike Schmidt 106.41
Ted Williams 107.86 Rickey Henderson 103.90
Tom Seaver 104.31 Bert Blyleven 101.82
Lefty Grove 102.54 Steve Carlton 100.34
Tris Speaker 102.26 Lefty Grove 98.80
Justin Verlander 100.23 Albert Pujols 97.34
Joe Morgan 100.17 Joe Morgan 96.07
Frank Robinson 99.93 Frank Robinson 95.92
Mel Ott 99.74 Mel Ott 95.72
Bert Blyleven 97.69 Tris Speaker 95.13
Cal Ripken Jr 97.39 Justin Verlander 95.07
Rogers Hornsby 97.01 Gaylord Perry 94.45
Lou Gehrig 95.87 Rogers Hornsby 94.42

Sensitivity Analysis for the condition where some talented potential baseball players fail to start their sports career in baseball.

In this section, we perform a sensitivity analysis for the condition where some talented potential baseball players fail to start their sports career in baseball. For example, people argue that Kyler Murray and Pat Mahomes are playing in the NFL, but they are also considered as the two of the most talented potential baseball players. Competition from other sports is fierce at the upper end of the talent pool where multiple sport opportunities are common.

In this sensitivity analysis, we assume the 10th, 20th, …, 100th talented potential baseball players fail to start their sports career in baseball, which indicates the player with 10th largest bWAR or fWAR is paired with 11th largest talent, the player with 20th largest bWAR or fWAR is paired with 22th largest talent, and so on. Then we mapped their talents into the common mapping environment we build before and compute the era-adjusted bWAR and fWAR. We perform this analysis for the seasons after 1950 season based on the effect of baseball integration. We also perform this analysis for the seasons after 1994 season based on the effect of MLB strike.

The tables below are the top 25 ebWAR and efWAR for batters and pitchers combined with respect to this sensitivity analysis.

without rm ebWAR rm after 1950 ebWAR rm after 1994 ebWAR
Barry Bonds 153.89 Barry Bonds 153.89 Barry Bonds 153.89
Roger Clemens 145.88 Roger Clemens 145.88 Roger Clemens 145.88
Willie Mays 144.08 Willie Mays 144.08 Willie Mays 144.08
Babe Ruth 137.98 Babe Ruth 137.98 Babe Ruth 137.98
Henry Aaron 135.60 Henry Aaron 135.60 Henry Aaron 135.60
Alex Rodriguez 120.29 Alex Rodriguez 120.29 Alex Rodriguez 120.29
Stan Musial 119.51 Stan Musial 119.51 Stan Musial 119.51
Ty Cobb 114.48 Ty Cobb 114.48 Ty Cobb 114.48
Greg Maddux 113.66 Greg Maddux 113.66 Greg Maddux 113.66
Albert Pujols 111.86 Albert Pujols 111.86 Albert Pujols 111.86
Randy Johnson 110.81 Randy Johnson 110.81 Randy Johnson 110.81
Mike Schmidt 109.58 Mike Schmidt 109.58 Mike Schmidt 109.58
Rickey Henderson 109.08 Rickey Henderson 109.08 Rickey Henderson 109.08
Ted Williams 107.86 Ted Williams 107.86 Ted Williams 107.86
Tom Seaver 104.31 Tom Seaver 104.31 Tom Seaver 104.31
Lefty Grove 102.54 Lefty Grove 102.54 Lefty Grove 102.54
Tris Speaker 102.26 Tris Speaker 102.26 Tris Speaker 102.26
Justin Verlander 100.23 Justin Verlander 100.23 Justin Verlander 100.23
Joe Morgan 100.17 Joe Morgan 100.17 Joe Morgan 100.17
Frank Robinson 99.93 Frank Robinson 99.93 Frank Robinson 99.93
Mel Ott 99.74 Mel Ott 99.74 Mel Ott 99.74
Bert Blyleven 97.69 Bert Blyleven 97.69 Bert Blyleven 97.69
Cal Ripken Jr 97.39 Cal Ripken Jr 97.39 Cal Ripken Jr 97.39
Rogers Hornsby 97.01 Rogers Hornsby 97.01 Rogers Hornsby 97.01
Lou Gehrig 95.87 Lou Gehrig 95.87 Lou Gehrig 95.87
without rm efWAR rm after 1950 efWAR rm after 1994 efWAR
Barry Bonds 145.24 Barry Bonds 145.24 Barry Bonds 145.24
Roger Clemens 141.25 Roger Clemens 141.25 Roger Clemens 141.25
Willie Mays 135.39 Willie Mays 135.39 Willie Mays 135.39
Henry Aaron 128.05 Henry Aaron 128.05 Henry Aaron 128.05
Greg Maddux 120.73 Greg Maddux 120.73 Greg Maddux 120.73
Babe Ruth 120.28 Babe Ruth 120.28 Babe Ruth 120.28
Stan Musial 113.03 Stan Musial 113.03 Stan Musial 113.03
Alex Rodriguez 110.30 Alex Rodriguez 110.30 Alex Rodriguez 110.30
Randy Johnson 109.77 Randy Johnson 109.77 Randy Johnson 109.77
Ty Cobb 108.77 Ty Cobb 108.77 Ty Cobb 108.77
Nolan Ryan 108.30 Nolan Ryan 108.30 Nolan Ryan 108.30
Ted Williams 107.75 Ted Williams 107.75 Ted Williams 107.75
Mike Schmidt 106.41 Mike Schmidt 106.41 Mike Schmidt 106.41
Rickey Henderson 103.90 Rickey Henderson 103.90 Rickey Henderson 103.90
Bert Blyleven 101.82 Bert Blyleven 101.82 Bert Blyleven 101.82
Steve Carlton 100.34 Steve Carlton 100.34 Steve Carlton 100.34
Lefty Grove 98.80 Lefty Grove 98.80 Lefty Grove 98.80
Albert Pujols 97.34 Albert Pujols 97.34 Albert Pujols 97.34
Joe Morgan 96.07 Joe Morgan 96.07 Joe Morgan 96.07
Frank Robinson 95.92 Frank Robinson 95.92 Frank Robinson 95.92
Mel Ott 95.72 Mel Ott 95.72 Mel Ott 95.72
Tris Speaker 95.13 Tris Speaker 95.13 Tris Speaker 95.13
Justin Verlander 95.07 Justin Verlander 95.07 Justin Verlander 95.07
Gaylord Perry 94.45 Gaylord Perry 94.45 Gaylord Perry 94.45
Rogers Hornsby 94.42 Rogers Hornsby 94.42 Rogers Hornsby 94.42

The tables below are the top 25 ebJAWS and efJAWS for batters and pitchers combined with respect to this sensitivity analysis.

without rm ebJAWS rm after 1950 ebJAWS rm after 1994 ebJAWS
Barry Bonds 109.14 Barry Bonds 108.86 Barry Bonds 109.01
Roger Clemens 107.47 Roger Clemens 106.80 Roger Clemens 107.03
Willie Mays 105.16 Willie Mays 104.96 Willie Mays 105.32
Babe Ruth 100.70 Babe Ruth 100.70 Babe Ruth 100.70
Henry Aaron 95.11 Henry Aaron 94.38 Henry Aaron 95.03
Alex Rodriguez 91.66 Alex Rodriguez 91.01 Alex Rodriguez 91.01
Stan Musial 88.38 Stan Musial 87.15 Stan Musial 88.24
Randy Johnson 88.20 Randy Johnson 87.10 Randy Johnson 87.78
Albert Pujols 86.33 Lefty Grove 84.52 Greg Maddux 84.78
Greg Maddux 85.60 Mike Schmidt 84.25 Mike Schmidt 84.53
Mike Schmidt 84.52 Albert Pujols 84.14 Lefty Grove 84.52
Lefty Grove 84.52 Greg Maddux 83.77 Albert Pujols 84.14
Ted Williams 83.54 Ted Williams 83.36 Ted Williams 83.59
Ty Cobb 82.62 Ty Cobb 82.60 Ty Cobb 82.60
Rickey Henderson 82.14 Rickey Henderson 80.89 Rickey Henderson 81.05
Justin Verlander 80.19 Joe Morgan 78.01 Joe Morgan 79.04
Joe Morgan 79.03 Tom Seaver 77.36 Tom Seaver 78.49
Tom Seaver 78.50 Rogers Hornsby 76.61 Cal Ripken Jr 76.85
Cal Ripken Jr 77.54 Cal Ripken Jr 75.98 Rogers Hornsby 76.61
Mike Trout 77.26 Lou Gehrig 75.70 Lou Gehrig 75.70
Rogers Hornsby 76.61 Wade Boggs 74.72 Mickey Mantle 75.11
Lou Gehrig 75.70 Tris Speaker 74.67 Bert Blyleven 75.01
Wade Boggs 75.61 Mickey Mantle 74.50 Wade Boggs 75.00
Clayton Kershaw 75.10 Mel Ott 74.19 Tris Speaker 74.67
Mickey Mantle 75.09 Justin Verlander 73.64 Mel Ott 74.19
without rm efJAWS rm after 1950 efJAWS rm after 1994 efJAWS
Roger Clemens 103.54 Roger Clemens 103.11 Roger Clemens 103.26
Barry Bonds 103.21 Barry Bonds 102.94 Barry Bonds 103.06
Willie Mays 98.72 Willie Mays 98.49 Willie Mays 98.84
Babe Ruth 90.37 Babe Ruth 90.37 Babe Ruth 90.37
Henry Aaron 89.52 Henry Aaron 88.70 Henry Aaron 89.45
Greg Maddux 88.51 Greg Maddux 87.72 Greg Maddux 88.14
Randy Johnson 85.78 Randy Johnson 85.08 Randy Johnson 85.52
Alex Rodriguez 84.35 Alex Rodriguez 83.77 Alex Rodriguez 83.77
Stan Musial 83.61 Stan Musial 82.43 Stan Musial 83.46
Ted Williams 82.72 Ted Williams 82.41 Ted Williams 82.78
Mike Schmidt 82.20 Mike Schmidt 81.85 Mike Schmidt 82.21
Lefty Grove 79.31 Lefty Grove 79.31 Lefty Grove 79.31
Ty Cobb 78.85 Ty Cobb 78.83 Ty Cobb 78.83
Rickey Henderson 78.83 Rickey Henderson 77.38 Steve Carlton 78.32
Steve Carlton 78.32 Steve Carlton 77.22 Rickey Henderson 77.72
Nolan Ryan 76.91 Nolan Ryan 75.67 Nolan Ryan 76.94
Albert Pujols 75.80 Bert Blyleven 74.31 Joe Morgan 75.14
Joe Morgan 75.14 Rogers Hornsby 74.27 Bert Blyleven 75.12
Bert Blyleven 75.12 Joe Morgan 73.99 Rogers Hornsby 74.27
Justin Verlander 74.86 Albert Pujols 73.53 Mickey Mantle 73.72
Rogers Hornsby 74.27 Lou Gehrig 73.41 Albert Pujols 73.53
Mike Trout 73.90 Mickey Mantle 73.19 Lou Gehrig 73.41
Mickey Mantle 73.71 Cal Ripken Jr 72.06 Cal Ripken Jr 72.77
Cal Ripken Jr 73.62 Walter Johnson 72.00 Walter Johnson 72.00
Lou Gehrig 73.41 Mel Ott 71.10 Mel Ott 71.10

Multiverse Analysis

Batting Average

We will test four factors in our Full House Model and use batting average to illustrate it. These four factors are park-factor effect, population change, component distribution, and season size effect. The table shows the value of the era-adjusted BA of Tony Gwynn in the 1997 season minus the era-adjusted BA of Ty Cobb in the 1911 season under different configurations. The PF column indicates whether we apply the park-factor adjustment to the BA. The YES indicates we apply park-factor adjustment, and NO indicates we did not. The pops column indicates the population changes we apply to the talent pool. The 0.5_favor shows we consider 50% favorite sport, which is the talent pool we use; The 0.75_favor shows we consider 75% favorite sport; The 1_favor shows we consider 100% favorite; The constant shows we assume the talent pool did not change over time and we set it to 1 million; The erosion shows we consider minor league erosion. The details about the how we estimate the talent pool can be found in the tech report. The para column indicates we use parametric distribution or non-parametric distribution to measure the BA in each season. The para indicates we use parametric distribution to measure BA and nonpara indicates we use non-parametric distribution to measure BA. The league column indicates we consider two different league sizes as the number of components in each season. The Historical shows we use historical league size as the number of components in each season. The Fixed shows that we compute the maximum number of components in every season and consider this value as the fixed number of components each season. The diff column shows the value of the era-adjusted BA of Tony Gwynn in the 1997 season minus the BA of Ty Cobb in the 1911 season

Table 5: The value of the era-adjuted BA of Tony Gwynn in the 1997 season minus the era-adjusted BA of Ty Cobb in the 1911 season under different configurations in the supplementary materials

PF pops para league diff
YES 0.5_favor para historical 0.020
YES 0.5_favor para fixed 0.008
YES 0.5_favor nonpara historical 0.040
YES 0.5_favor nonpara fixed 0.040
YES 0.75_favor para historical 0.017
YES 0.75_favor para fixed 0.006
YES 0.75_favor nonpara historical 0.035
YES 0.75_favor nonpara fixed 0.035
YES 1_favor para historical 0.014
YES 1_favor para fixed 0.003
YES 1_favor nonpara historical 0.023
YES 1_favor nonpara fixed 0.023
YES constant para historical 0.006
YES constant para fixed -0.005
YES constant nonpara historical -0.002
YES constant nonpara fixed -0.002
YES erosion para historical 0.010
YES erosion para fixed -0.001
YES erosion nonpara historical 0.006
YES erosion nonpara fixed 0.006
NO 0.5_favor para historical 0.005
NO 0.5_favor para fixed 0.002
NO 0.5_favor nonpara historical 0.029
NO 0.5_favor nonpara fixed 0.029
NO 0.75_favor para historical 0.003
NO 0.75_favor para fixed -0.001
NO 0.75_favor nonpara historical 0.027
NO 0.75_favor nonpara fixed 0.027
NO 1_favor para historical 0.000
NO 1_favor para fixed -0.003
NO 1_favor nonpara historical 0.015
NO 1_favor nonpara fixed 0.015
NO constant para historical -0.007
NO constant para fixed -0.010
NO constant nonpara historical -0.003
NO constant nonpara fixed -0.003
NO erosion para historical -0.004
NO erosion para fixed -0.008
NO erosion nonpara historical 0.007
NO erosion nonpara fixed 0.007

bWAR

We also use bWAR to test the effect of population change, season size and trimming method in our Full House Model. The table shows the value of the career bWAR of Willie Mays minus the career bWAR of Babe Ruth under different configurations before and after applying the trimming method. The pops column indicates the population changes we apply to the talent pool. The 0.5_favor shows we consider 50% favorite sport, which is the talent pool we use; The 0.75_favor shows we consider 75% favorite sport; The 1_favor shows we consider 100% favorite; The constant shows we assume the talent pool did not change over time and we set it to 1 million; The erosion shows we consider minor league erosion. The details about the how we estimate the talent pool can be found in the tech report. The league column indicates we consider two different league sizes as the number of components in each season. The historical shows we use historical league size as the number of components in each season. The fixed shows that we compute the maximum number of components in every season and consider this value as the fixed number of components each season. The diff_after indicates the value of the career bWAR of Willie Mays minus the career bWAR of Babe Ruth under different configurations after applying the trimming method. The diff_before indicates the value of the career bWAR of Willie Mays minus the career bWAR of Babe Ruth under different configurations before applying the trimming method.

pops league diff_after diff_before
0.5_favor historical 6.22 11.50
0.5_favor fixed 6.21 11.52
0.75_favor historical 1.37 6.57
0.75_favor fixed 1.35 6.61
1_favor historical -2.76 -0.98
1_favor fixed -2.82 -0.78
constant historical -20.95 -21.34
constant fixed -20.97 -22.01
erosion historical -16.68 -18.93
erosion fixed -18.97 -19.63

Table 6: Era-adjusted JAWS (eJAWS) rankings computed with respect to talent pool estimates A-E in the manuscript

name eJAWS name eJAWS name eJAWS name eJAWS name eJAWS
Barry Bonds 106.17 Barry Bonds 106.47 Willie Mays 106.10 Babe Ruth 112.24 Babe Ruth 117.14
Roger Clemens 105.50 Roger Clemens 104.84 Babe Ruth 105.83 Barry Bonds 106.50 Ty Cobb 108.87
Willie Mays 101.94 Willie Mays 104.02 Barry Bonds 105.22 Roger Clemens 104.36 Willie Mays 107.52
Babe Ruth 95.50 Babe Ruth 100.28 Roger Clemens 103.35 Willie Mays 104.19 Barry Bonds 106.98
Henry Aaron 92.32 Henry Aaron 94.03 Henry Aaron 97.05 Ty Cobb 100.50 Cy Young 106.41
Alex Rodriguez 88.00 Stan Musial 88.33 Ty Cobb 92.31 Walter Johnson 95.30 Roger Clemens 105.75
Greg Maddux 87.06 Alex Rodriguez 87.46 Stan Musial 91.79 Henry Aaron 93.98 Cap Anson 105.38
Randy Johnson 86.99 Greg Maddux 86.28 Lefty Grove 89.55 Stan Musial 93.94 Walter Johnson 104.81
Stan Musial 86.00 Randy Johnson 86.07 Ted Williams 87.53 Lefty Grove 92.66 Honus Wagner 101.93
Mike Schmidt 83.36 Ty Cobb 86.04 Alex Rodriguez 85.86 Honus Wagner 91.14 Henry Aaron 98.54
Ted Williams 83.13 Ted Williams 85.27 Walter Johnson 85.74 Tris Speaker 90.48 Stan Musial 98.13
Lefty Grove 81.91 Lefty Grove 85.18 Greg Maddux 84.66 Cy Young 89.53 Tris Speaker 97.73
Albert Pujols 81.06 Mike Schmidt 83.93 Mike Schmidt 84.41 Ted Williams 89.32 Lefty Grove 93.98
Ty Cobb 80.74 Rickey Henderson 79.87 Randy Johnson 84.18 Rogers Hornsby 88.20 Eddie Collins 93.82
Rickey Henderson 80.48 Albert Pujols 79.37 Rogers Hornsby 83.75 Alex Rodriguez 87.94 Ted Williams 92.09
Justin Verlander 77.82 Rogers Hornsby 79.16 Tris Speaker 82.72 Greg Maddux 86.59 Rogers Hornsby 92.03
Joe Morgan 77.09 Joe Morgan 78.39 Lou Gehrig 81.15 Randy Johnson 85.56 Nap Lajoie 88.60
Cal Ripken Jr 75.58 Walter Johnson 77.96 Joe Morgan 79.85 Eddie Collins 85.07 Greg Maddux 87.34
Mike Trout 75.58 Tris Speaker 77.37 Mel Ott 79.75 Mel Ott 84.97 Randy Johnson 86.86
Rogers Hornsby 75.44 Lou Gehrig 77.32 Honus Wagner 79.60 Lou Gehrig 84.76 Mel Ott 86.76
Bert Blyleven 75.06 Justin Verlander 76.57 Rickey Henderson 78.68 Mike Schmidt 83.19 Lou Gehrig 86.26
Lou Gehrig 74.56 Mickey Mantle 76.25 Mickey Mantle 78.68 Cap Anson 81.33 Alex Rodriguez 85.12
Mickey Mantle 74.40 Mike Trout 75.43 Tom Seaver 76.67 Albert Pujols 81.03 Pete Alexander 83.86
Steve Carlton 74.40 Bert Blyleven 75.43 Steve Carlton 76.54 Rickey Henderson 80.28 Mike Schmidt 83.62
Tom Seaver 74.15 Steve Carlton 75.33 Frank Robinson 76.46 Jimmie Foxx 78.52 Mickey Mantle 83.50

Multiverse Analysis: Home Run

We will test three factors in our Full House Model and use home run to illustrate it. These four factors are park-factor effect, population change, and season size effect. The table shows the value of the era-adjusted HR of Barry Bonds in the 2001 season minus the era-adjusted HR of Babe Ruth in the 1920 season under different configurations. The PF column indicates whether we apply the park-factor adjustment to the HR. The YES indicates we apply park-factor adjustment, and NO indicates we did not. The pops column indicates the population changes we apply to the talent pool. The 0.5_favor shows we consider 50% favorite sport, which is the talent pool we use; The 0.75_favor shows we consider 75% favorite sport; The 1_favor shows we consider 100% favorite; The constant shows we assume the talent pool did not change over time and we set it to 1 million; The erosion shows we consider minor league erosion. The details about the how we estimate the talent pool can be found in the tech report. The league column indicates we consider two different league sizes as the number of components in each season. The Historical shows we use historical league size as the number of components in each season. The Fixed shows that we compute the maximum number of components in every season and consider this value as the fixed number of components each season. The diff column shows the value of the era-adjusted HR of Barry Bonds in the 2001 season minus the era-adjusted HR of Babe Ruth in the 1920 season.

Table 6: The value of the era-adjusted AB per HR of Barry Bonds in the 2001 season minus the era-adjusted AB per HR of Babe Ruth in the 1920 season under different configurations in the supplementary materials

PF pops league diff
YES 0.5_favor historical -0.0366554
YES 0.5_favor fixed -0.0366529
YES 0.75_favor historical -0.0115940
YES 0.75_favor fixed -0.0115925
YES 1_favor historical 0.0066636
YES 1_favor fixed 0.0066644
YES erosion historical 0.0196551
YES erosion fixed 0.0196557
YES constant historical 0.0284328
YES constant fixed 0.0284331
NO 0.5_favor historical -0.0099727
NO 0.5_favor fixed -0.0099658
NO 0.75_favor historical 0.0378487
NO 0.75_favor fixed 0.0378545
NO 1_favor historical 0.0841428
NO 1_favor fixed 0.0841461
NO erosion historical 0.1004012
NO erosion fixed 0.1004028
NO constant historical 0.0995427
NO constant fixed 0.0995433

Table 7: Top-25 eJAWS rankings after removing people from the talent pool in the manuscript

w/o rm eJAWS rm 1950 eJAWS rm 1994 eJAWS
Barry Bonds 106.17 Barry Bonds 105.90 Barry Bonds 106.03
Roger Clemens 105.50 Roger Clemens 104.96 Roger Clemens 105.15
Willie Mays 101.94 Willie Mays 101.72 Willie Mays 102.08
Babe Ruth 95.50 Babe Ruth 95.50 Babe Ruth 95.50
Henry Aaron 92.32 Henry Aaron 91.54 Henry Aaron 92.24
Alex Rodriguez 88.00 Alex Rodriguez 87.39 Alex Rodriguez 87.39
Greg Maddux 87.06 Randy Johnson 86.09 Randy Johnson 86.65
Randy Johnson 86.99 Greg Maddux 85.75 Greg Maddux 86.46
Stan Musial 86.00 Stan Musial 84.79 Stan Musial 85.85
Mike Schmidt 83.36 Mike Schmidt 83.05 Mike Schmidt 83.37
Ted Williams 83.13 Ted Williams 82.88 Ted Williams 83.18
Lefty Grove 81.91 Lefty Grove 81.91 Lefty Grove 81.91
Albert Pujols 81.06 Ty Cobb 80.72 Ty Cobb 80.72
Ty Cobb 80.74 Rickey Henderson 79.13 Rickey Henderson 79.38
Rickey Henderson 80.48 Albert Pujols 78.84 Albert Pujols 78.84
Justin Verlander 77.82 Joe Morgan 76.00 Joe Morgan 77.09
Joe Morgan 77.09 Rogers Hornsby 75.44 Rogers Hornsby 75.44
Cal Ripken Jr 75.58 Lou Gehrig 74.56 Bert Blyleven 75.06
Mike Trout 75.58 Cal Ripken Jr 74.02 Cal Ripken Jr 74.81
Rogers Hornsby 75.44 Mickey Mantle 73.84 Lou Gehrig 74.56
Bert Blyleven 75.06 Bert Blyleven 73.78 Mickey Mantle 74.41
Lou Gehrig 74.56 Tom Seaver 73.05 Steve Carlton 74.40
Mickey Mantle 74.40 Steve Carlton 72.97 Tom Seaver 74.13
Steve Carlton 74.40 Wade Boggs 72.65 Wade Boggs 73.05
Tom Seaver 74.15 Mel Ott 72.64 Mel Ott 72.64

Figure 5: Estimated WAR of a 2-WAR player in 2023 in the manuscript

Figure 6: Tail probability for outlying Ruth and Bonds seasons in the manuscript

Table 4: The bWAR rankings from Full House Model using different talent generating function in the Supplementary materials.

In this section, we use different talent generating process to verify the robustness of the model.

The table below shows the top 25 bWAR players using four different talent generating process. The talent follows folded normal distribution, normal distribution, Pareto distribution with \(\alpha = 3\) and Pareto distribution with \(\alpha = 1.16\). Given that the ranking lists produced by the four separate generating processes are identical, we can say that our Full House model’s talent generating process is fairly robust.

rank standard normal era-adjusted bWAR Folded normal (mu = 0, sigma = 1) era-adjusted bWAR Pareto with alpha = 3 era-adjusted bWAR Pareto with alpha = 1.16 era-adjusted bWAR
1 Barry Bonds 153.93482003704 Barry Bonds 153.934820057079 Barry Bonds 153.93482003704 Barry Bonds 153.900596672323
2 Roger Clemens 145.907695934742 Roger Clemens 145.907695925218 Roger Clemens 145.907695934742 Roger Clemens 145.907695934742
3 Willie Mays 144.20526296462 Willie Mays 144.205262974907 Willie Mays 144.20526296462 Willie Mays 144.095468112426
4 Henry Aaron 135.560701275838 Henry Aaron 135.560701289904 Henry Aaron 135.560701275838 Henry Aaron 135.601861141104
5 Babe Ruth 132.702324888145 Babe Ruth 132.702324893792 Babe Ruth 132.702324888145 Babe Ruth 132.702324888145
6 Stan Musial 119.25776183457 Stan Musial 119.257761829684 Stan Musial 119.25776183457 Stan Musial 119.509994291424
7 Alex Rodriguez 119.05928887207 Alex Rodriguez 119.059288892948 Alex Rodriguez 119.05928887207 Alex Rodriguez 119.066070765468
8 Greg Maddux 113.670801140907 Greg Maddux 113.670801146666 Greg Maddux 113.670801140907 Greg Maddux 113.670801140907
9 Ty Cobb 112.000334920716 Ty Cobb 112.000334926544 Ty Cobb 112.000334920716 Ty Cobb 112.003670261108
10 Randy Johnson 110.812019546253 Randy Johnson 110.812019560192 Randy Johnson 110.812019546253 Albert Pujols 111.852368250334
11 Mike Schmidt 109.61435217199 Mike Schmidt 109.614352185456 Mike Schmidt 109.61435217199 Randy Johnson 110.812019546253
12 Albert Pujols 109.147332313036 Albert Pujols 109.147332303033 Albert Pujols 109.147332313036 Mike Schmidt 109.591837967447
13 Rickey Henderson 109.137893304437 Rickey Henderson 109.137893313238 Rickey Henderson 109.137893304437 Rickey Henderson 109.063644359715
14 Ted Williams 108.155500712145 Ted Williams 108.155500712963 Ted Williams 108.155500712145 Ted Williams 108.037777234477
15 Tom Seaver 104.28306193596 Tom Seaver 104.283061916776 Tom Seaver 104.28306193596 Tom Seaver 104.305946556062
16 Lefty Grove 101.582214503872 Lefty Grove 101.582214523742 Lefty Grove 101.582214503872 Lefty Grove 101.582214503872
17 Tris Speaker 100.72466821009 Tris Speaker 100.72466820624 Tris Speaker 100.72466821009 Tris Speaker 100.699554124141
18 Joe Morgan 100.182322428416 Joe Morgan 100.182322426815 Joe Morgan 100.182322428416 Justin Verlander 100.240996144687
19 Frank Robinson 100.036273382032 Frank Robinson 100.036273385659 Frank Robinson 100.036273382032 Joe Morgan 100.168379881699
20 Bert Blyleven 97.6905454153586 Bert Blyleven 97.6905454263732 Bert Blyleven 97.6905454153586 Frank Robinson 99.9106587464703
21 Cal Ripken Jr 97.4212949564655 Cal Ripken Jr 97.4212949308992 Cal Ripken Jr 97.4212949564655 Bert Blyleven 97.6905454153586
22 Mel Ott 96.690996783323 Mel Ott 96.6909967816042 Mel Ott 96.690996783323 Cal Ripken Jr 97.4130854322836
23 Rogers Hornsby 95.7583931883965 Rogers Hornsby 95.7583931873413 Rogers Hornsby 95.7583931883965 Mel Ott 96.716110869205
24 Lou Gehrig 95.6758386877758 Lou Gehrig 95.675838706285 Lou Gehrig 95.6758386877758 Rogers Hornsby 95.7583931883965
25 Mickey Mantle 95.4051563355476 Mickey Mantle 95.4051563322911 Mickey Mantle 95.4051563355476 Lou Gehrig 95.6758386877758

Figure 3: The relationship between bWAR talent and bWAR per game in the supplementary materials

Figure 4: The pairings of the maximum talent scores with their corresponding era-adjusted bWAR for players in the common mapping environment in the supplementary materials

Table 7: Whether or not the distribution of the batting statistics is parametric or nonparametric affects the top 25 career batting averages and top 25 four-year peaks by batting average.

name BA name BA name year BA name year BA
Tony Gwynn 0.342 Tony Gwynn 0.338 Jose Altuve 2014-2017 0.367 Tony Gwynn 1994-1997 0.360
Rod Carew 0.329 Ty Cobb 0.332 Tony Gwynn 1994-1997 0.366 Ty Cobb 1916-1919 0.353
Jose Altuve 0.327 Rod Carew 0.324 Rod Carew 1974-1977 0.363 Wade Boggs 1985-1988 0.351
Ichiro Suzuki 0.327 Ichiro Suzuki 0.322 Miguel Cabrera 2010-2013 0.355 Rod Carew 1974-1977 0.350
Miguel Cabrera 0.320 Jose Altuve 0.320 Wade Boggs 1985-1988 0.353 Ichiro Suzuki 2001-2004 0.348
Roberto Clemente 0.320 Roberto Clemente 0.318 Ichiro Suzuki 2001-2004 0.353 Rogers Hornsby 1921-1924 0.346
Ty Cobb 0.320 Joe DiMaggio 0.318 Barry Bonds 2001-2004 0.352 Jose Altuve 2014-2017 0.345
Joe DiMaggio 0.318 Shoeless Joe Jackson 0.316 Joe Mauer 2006-2009 0.350 Mike Piazza 1995-1998 0.345
Wade Boggs 0.316 Wade Boggs 0.314 Roberto Clemente 1964-1967 0.345 Barry Bonds 2001-2004 0.344
Buster Posey 0.316 Freddie Freeman 0.314 Joe DiMaggio 1938-1941 0.345 Joe DiMaggio 1939-1942 0.343
Mike Trout 0.315 Stan Musial 0.314 Albert Pujols 2003-2006 0.343 Don Mattingly 1984-1987 0.340
Freddie Freeman 0.314 Ted Williams 0.314 Don Mattingly 1984-1987 0.341 Henry Aaron 1956-1959 0.339
Joe Mauer 0.314 Henry Aaron 0.313 Mike Piazza 1995-1998 0.341 Roberto Clemente 1969-1972 0.338
Ted Williams 0.314 Buster Posey 0.313 Willie Mays 1957-1960 0.340 Stan Musial 1943-1946 0.338
Stan Musial 0.313 Mike Trout 0.312 Matty Alou 1966-1969 0.339 Joe Mauer 2006-2009 0.337
Willie Mays 0.312 Matty Alou 0.311 Tim Anderson 2019-2022 0.338 Miguel Cabrera 2010-2013 0.336
Bill Terry 0.312 Miguel Cabrera 0.311 Stan Musial 1943-1946 0.338 Nap Lajoie 1901-1904 0.336
Robinson Canó 0.311 Robinson Canó 0.311 Rogers Hornsby 1922-1925 0.335 Albert Pujols 2003-2006 0.336
Henry Aaron 0.310 Vladimir Guerrero 0.311 Ted Williams 1943-1946 0.335 Matty Alou 1966-1969 0.335
Matty Alou 0.310 Joe Mauer 0.311 Ty Cobb 1912-1915 0.334 Honus Wagner 1905-1908 0.335
Vladimir Guerrero 0.310 Rogers Hornsby 0.310 Trea Turner 2019-2022 0.334 Tris Speaker 1913-1916 0.334
Derek Jeter 0.310 Willie Mays 0.310 Henry Aaron 1956-1959 0.333 Ted Williams 1943-1946 0.334
Al Oliver 0.310 Kirby Puckett 0.310 Cecil Cooper 1980-1983 0.333 Freddie Freeman 2020-2023 0.332
Lou Gehrig 0.309 Bill Terry 0.310 Freddie Freeman 2020-2023 0.333 Lou Gehrig 1932-1935 0.332
Edgar Martinez 0.309 Lou Gehrig 0.309 Nap Lajoie 1901-1904 0.333 Willie Mays 1957-1960 0.332

Figure 1: Histogram of the p-values from Shapiro-Wilk test of normality on the BA distribution in each season in the Supplementary materials.